ENH: DAOS and DFS modules #1014

shanedsnyder · 2024-10-31T19:25:36Z

This PR adds new instrumentation of DAOS storage APIs and corresponding updates to our analysis tools to integrate this DAOS data. Specifically, 2 new Darshan modules are defined: DARSHAN_DFS_MOD for instrumenting usage of the DAOS file system (DFS) API and DARSHAN_DAOS_MOD for instrumenting native DAOS object APIs. More details on each module below.

DFS module:

For each DFS file, Darshan captures a fixed set of integer/FP counters (see full list in dfs-log-format.h) and the corresponding DAOS pool/container UUIDs.
DFS file record names are based on the full path in the DFS directory tree, similar to our other file-based modules.
DFS file record IDs are based off of the underlying DAOS OID, not the file name.
- This approach was used, because not all DFS file open routines take a file name as input (e.g., dfs_obj_global2local()), meaning not all processes will have the file name available to generate a consistent record ID -- using the object OID allows all processes to agree on a consistent record ID value.
  - See issue BUG: Allow names for shared records to come fron non-zero ranks #1026 for tracking the issue on how to properly agree on a shared record file name in cases when rank 0 does not have the name.
- One side effect of this approach worth mentioning is that, since Darshan records are based on underlying OIDs and not file names, deleting/recreating files will result in multiple Darshan records corresponding to the same file -- this behavior can be easily observed in benchmarks like IOR which delete/recreate the output file on each iteration. It will ultimately be the responsibility of analysis tools to aggregate file records in this case.
Asynchronous I/O capture is fully supported in Darshan instrumentation wrappers for the DFS interface.
The pool_uuid:cont_uuid combo is used in place of the mount pt in tools like darshan-parser.
- Note that for applications using dfs_connect() (rather than dfs_mount()), Darshan has no way to obtain the pool and container UUIDs, and indicates that the combo is "UNKNOWN" in parsing tools.

Example darshan-parser output line:

#<module>       <rank>  <record id>     <counter>       <value> <file name>     <mount pt>      <fs type>
DFS     -1      13156018442998895329    DFS_OPENS       2       /testFile       f4996f65-9c9a-41c6-ac18-88059a11aeb1:b445df4d-0f29-4
62a-9c70-a80bf5a5a0f9       N/A

DAOS module:

For each DAOS object, Darshan captures a fixed set of integer/FP counters (see full list in daos-log-format.h), the corresponding DAOS pool/container UUIDs, and the full DAOS OID.
- There are actually 3 distinct DAOS object APIs tracked in the Darshan DAOS module: object (DAOS_OBJ), array (DAOS_ARRAY), and KV (DAOS_KV).
DAOS object records have no name -- when printing these records in darshan-util programs, we just print the OID in string format (i.e., oid_hi.oid_lo, same approach as DAOS's own utilities)
- Small changes were made to darshan-runtime and darshan-util libraries to allow for records that have no name associated.
DAOS file record IDs are based off of the underlying DAOS OID.
- This makes it trivial to identify which DAOS object records correspond to which DFS file records, as they will have the same Darshan record identifier.
Asynchronous I/O capture is fully supported in Darshan instrumentation wrappers for DAOS interfaces.
The pool_uuid:cont_uuid combo is used in place of the mount pt in tools like darshan-parser.

Example darshan-parser output line:

#<module>       <rank>  <record id>     <counter>       <value> <file name>     <mount pt>      <fs type>
DAOS    -1      13156018442998895329    DAOS_OBJ_OPENS  1       937047793718163273.416  f4996f65-9c9a-41c6-ac18-88059a11aeb1:b445df4d-0f29-462a-9c70-a80bf5a5a0f9       N/A

Both DFS and DAOS modules integrate with the Darshan heatmap module to generate histograms of I/O activity on each process. Both DFS and DAOS modules have also fully implemented darshan-util and PyDarshan functionality, including support for generating PyDarshan summary reports detailing DFS/DAOS access patterns. PyDarshan tests have been updated to ensure expected behavior when parsing logs containing DFS/DAOS data.

There are a few outstanding items that are not addressed in this PR:

There is no DXT support for DAOS modules, yet. It seems like the right call to try to limit the scope of changes here and weigh that capability with other development priorities going forward.
DAOS data is integrated into most of the relevant sections in PyDarshan summary reports, but not in the "data access by category" plots. I created an issue to track this: ENH: add new DAOS module data to PyDarshan "data access by category" plots #1015

Replaces #739

* add CFFI shims needed to access DFS record data at the Python level * adjust `test_main_all_logs_repo_files()` to handle the new `ior` `DFS` log file from Shane--it has a single runtime heatmap for `STDIO` * `test_module_table()` has been updated with a regression case for Shane's new DFS log file * add `test_dfs_daos_posix_match()` to ensure counter equivalence between similar `ior..` runs with DAOS vs. POSIX (NOTE: these actually don't look that similar yet--xfailed for now..)

* adjust `test_dfs_daos_posix_match()` to handle the two new POSIX/DAOS "mirror files" from Shane; the `xfail` has been removed and it now passes * there seems to be soem reasonable agreement between the logs, which is good; see the test proper for data columns that do not match or required special handling for DFS-POSIX equivalence testing * a few other test suite shims after Shane changed the POSIX/DAOS mirror files

* add DFS support to I/O cost graph in summary reports, with some light unit testing

* add a DFS per-module stats section to the Python summary report, and some initial tests

* simplify the "time" counter handling in `test_dfs_daos_posix_match()` based on reviewer feedback * `DFS_SLOWEST_RANK` is ignored in the comparisons in `test_dfs_daos_posix_match()` based on reviewer feedback * the comment about `STAT` counter differences in `test_dfs_daos_posix_match` was removed, based on reviewer feedback

The OID backing a DFS file can change if the file is deleted and recreated.

This reverts commit c6e6936.

also, cleanup file/object terminology in job summary

- when using dfs_connect, Darshan has no insight into the pool and container UUIDs being used, so gracefully allow instrumentation to proceed using null UUIDs for those values

shanedsnyder added this to the 3.4.7 milestone Oct 31, 2024

github-actions bot added the pydarshan label Oct 31, 2024

shanedsnyder closed this Nov 8, 2024

shanedsnyder reopened this Nov 8, 2024

shanedsnyder changed the title ~~WIP: DAOS and DFS modules~~ ENH: DAOS and DFS modules Nov 12, 2024

shanedsnyder changed the title ~~ENH: DAOS and DFS modules~~ [WIP] ENH: DAOS and DFS modules Nov 12, 2024

shanedsnyder force-pushed the snyder/dev-daos-module-3.4 branch from 7c0d9b1 to 00b0a20 Compare April 24, 2025 14:46

Shane Snyder and others added 23 commits May 7, 2025 23:02

initial stubbed out DAOS DFS module

408efdb

first cut at entire dfs runtime/util code

e9f66e2

autoconf/automake support for daos module

77b3b2a

adopt new darshan-core module api

e4f697a

fix up new compile errors/warnings

9908f96

teach automake about daos ld-opts

80e622c

updated comments on missing functionality

8af682c

comment out move/exchange wrappers, need more work

5c854eb

changes to support instrumenting obj_global2local

31c4c43

add example log to temporarily test with

7af5ecd

added new IOR example log files

500b5a2

MAINT: PR 739 revisions

2a0361a

* add DFS support to I/O cost graph in summary reports, with some light unit testing

MAINT: PR 739 revisions

b45a166

* add a DFS per-module stats section to the Python summary report, and some initial tests

rename existing DAOS files to DFS

b6d31bb

fix header guard

002dd29

instrument initial DAOS obj/array routines

1ae1b38

more instrumentation of native daos APIs

956d4de

more includes needed for DAOS header ac checks

a04c95b

use filename rather than OID to generate DFS IDs

e04c2ac

The OID backing a DFS file can change if the file is deleted and recreated.

Revert "use filename rather than OID to generate DFS IDs"

2469c8d

This reverts commit c6e6936.

Shane Snyder and others added 21 commits May 7, 2025 23:02

proper size calculation for array API

e43c5ca

drop DFS_USE_DTX counter

e842b9e

filter out DFS records that do no I/O operations

9245572

cleanup some runtime daos/dfs code

b6741ef

updated darshan-runtime docs for daos

4b25233

updated darshan-util docs for daos

7908b08

small darshan-util tweaks

1b2f3c4

add checks for libuuid to darshan-util configure

9fe0884

updated pydarshan for DFS module

638e624

pydarshan updates for DAOS module

86fbef9

forgot to compare DFS vs DAOS values

8ecb968

drop DAOS records with no real I/O activity

2cc9fd7

updated pydarshan to support DFS/DAOS heatmaps

0b18b11

enforce order for DAOS heatmag figs

12e8782

add more DAOS ops to opcount plots

decda31

generate module overview table for DAOS

eed4d72

also, cleanup file/object terminology in job summary

update DAOS opcount tests

f84513f

add heatmap support for DFS/DAOS modules

b072d44

gracefully handle case with no DFS mount info

e2fbc43

- when using dfs_connect, Darshan has no insight into the pool and container UUIDs being used, so gracefully allow instrumentation to proceed using null UUIDs for those values

add async support to DFS/DAOS modules

adfc568

daos not enabled by default fix in docs

b3bf403

shanedsnyder force-pushed the snyder/dev-daos-module-3.4 branch from 4fda11d to b3bf403 Compare May 7, 2025 23:03

Shane Snyder and others added 6 commits May 8, 2025 00:10

updated util docs

47223ca

updated Darshan log

a3277c6

fix PyDarshan DAOS struct def

2bca3eb

fix op count label

4011855

updated pytest

62dc6e3

bug fix in pydarshan tests

4ce04a3

shanedsnyder changed the title ~~[WIP] ENH: DAOS and DFS modules~~ ENH: DAOS and DFS modules May 8, 2025

shanedsnyder merged commit f2890c5 into main May 8, 2025
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ENH: DAOS and DFS modules #1014

ENH: DAOS and DFS modules #1014

Uh oh!

shanedsnyder commented Oct 31, 2024 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ENH: DAOS and DFS modules #1014

ENH: DAOS and DFS modules #1014

Uh oh!

Conversation

shanedsnyder commented Oct 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shanedsnyder commented Oct 31, 2024 •

edited

Loading