DLIO and Darshan

I was trying out Darshan with DLIO after reading the paper [I/O in Machine Learning Applications on HPC Systems: A 360-degree Survey](https://arxiv.org/abs/2404.10386).

I noticed that Darshan (in DXT mode) captures read operations but not write, though my DLIO workload generates large checkpoints:
```bash
> du -h checkpoints/resnet50_my_a100_pytorch 

36G	checkpoints/resnet50_my_a100_pytorch/global_epoch9_step3
36G	checkpoints/resnet50_my_a100_pytorch/global_epoch7_step3
36G	checkpoints/resnet50_my_a100_pytorch/global_epoch1_step3
36G	checkpoints/resnet50_my_a100_pytorch/global_epoch3_step3
36G	checkpoints/resnet50_my_a100_pytorch/global_epoch5_step3
178G	checkpoints/resnet50_my_a100_pytorch
```

Also, DLIO shows that are generated (all ranks read and write):
```bash
[OUTPUT] 2025-05-02T10:11:04.936647 Starting data generation
[OUTPUT] 2025-05-02T10:11:06.154185 Generation done
[OUTPUT] 2025-05-02T10:11:06.217784 Total number of parameters in the model: 3172149248
[OUTPUT] 2025-05-02T10:11:06.307744 Model size: 0.0000 GB
[OUTPUT] 2025-05-02T10:11:06.307949 Optimizer state size: 35.4625 GB
[OUTPUT] 2025-05-02T10:11:06.308028 Total checkpoint size: 35.4625 GB
[OUTPUT] 2025-05-02T10:11:06.308140 Max steps per epoch: 2 = 100 * 1024 / 400 / 96 (samples per file * num fil
es / batch size / comm size)
[OUTPUT] 2025-05-02T10:11:06.346457 Starting epoch 1: 2 steps expected
[OUTPUT] 2025-05-02T10:11:06.346818 Starting block 1
...
[OUTPUT] 2025-05-02T10:12:19.556005 Starting saving checkpoint 1 after total step 2 for epoch 7
[OUTPUT] 2025-05-02T10:12:19.591650 Saved model checkpoint in 0.0011 seconds
[OUTPUT] 2025-05-02T10:12:20.432879 Saved optimizer checkpoint in 0.8411 seconds
[OUTPUT] 2025-05-02T10:12:20.433231 Finished saving checkpoint 1 for epoch 7 in 0.8772 s; Throughput: 40.4257 GB/s
[OUTPUT] 2025-05-02T10:12:20.441221 Ending epoch 7 - 2 steps completed in 10.49 s
[OUTPUT] 2025-05-02T10:12:21.130292 Starting epoch 8: 2 steps expected
[OUTPUT] 2025-05-02T10:12:21.130710 Starting block 1
[OUTPUT] 2025-05-02T10:12:31.063011 Ending epoch 8 - 2 steps completed in 9.93 s
[OUTPUT] 2025-05-02T10:12:31.082444 Starting epoch 9: 2 steps expected
[OUTPUT] 2025-05-02T10:12:31.082824 Starting block 1
[OUTPUT] 2025-05-02T10:12:40.363424 Ending block 1 - 2 steps completed in 9.28 s
[OUTPUT] 2025-05-02T10:12:40.365390 Epoch 9 - Block 1 [Training] Accelerator Utilization [AU] (%): 64.6117
[OUTPUT] 2025-05-02T10:12:40.365474 Epoch 9 - Block 1 [Training] Throughput (samples/second): 0.0000
[OUTPUT] 2025-05-02T10:12:40.365556 Epoch 9 - Block 1 [Training] Computation time per step (second): 0.4351+/-0.0000 (set value: {'mean': 0.435})
[OUTPUT] 2025-05-02T10:12:40.365892 Starting saving checkpoint 1 after total step 2 for epoch 9
[OUTPUT] 2025-05-02T10:12:40.398263 Saved model checkpoint in 0.0018 seconds
[OUTPUT] 2025-05-02T10:12:41.964534 Saved optimizer checkpoint in 1.5661 seconds
[OUTPUT] 2025-05-02T10:12:41.964792 Finished saving checkpoint 1 for epoch 9 in 1.5989 s; Throughput: 22.1793 GB/s
[OUTPUT] 2025-05-02T10:12:41.972896 Ending epoch 9 - 2 steps completed in 10.89 s
[OUTPUT] 2025-05-02T10:12:42.022377 Starting epoch 10: 2 steps expected
[OUTPUT] 2025-05-02T10:12:42.022745 Starting block 1
[OUTPUT] 2025-05-02T10:12:51.297347 Ending epoch 10 - 2 steps completed in 9.27 s
[OUTPUT] 2025-05-02T10:12:51.327897 Saved outputs in  <hidden>/results/darshan/hydra_log/resnet50/2025-05-02-10-10-43
[OUTPUT] Averaged metric over all steps/epochs
[METRIC] ==========================================================
[METRIC] Number of Simulated Accelerators: 96 
[METRIC] Training Accelerator Utilization [AU] (%): 23.1229 (24.1460)
[METRIC] Training Throughput (samples/second): 0.0000 (0.0000)
[METRIC] Training I/O Throughput (MB/second): 0.0000 (0.0000)
[METRIC] train_au_meet_expectation: fail
[METRIC] Checkpoint save duration (seconds): 0.9046 (0.3615)
[METRIC] Checkpoint save I/O Throughput (GB/second): 44.2668 (13.3415)
[METRIC] ==========================================================


```

With the [metric proxy](https://github.com/A-Tarraf/proxy_v2) and a similar run, I could see the writes by capturing data using a modified strace:

![Image](https://github.com/user-attachments/assets/e7fb973c-6c97-4d06-be2a-ca6180209da0)
![Image](https://github.com/user-attachments/assets/baab652c-3f1f-4617-b515-9f4f2821f3ed)


I could share the .darshan file if needed (Github does not allow it). The profiler also confirms this info:
![Image](https://github.com/user-attachments/assets/034254b1-3657-4701-af22-66f101330ac7)
![Image](https://github.com/user-attachments/assets/188fbe33-8d06-4d71-911c-5870a93361c8)

It seems that it is missing the writes. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DLIO and Darshan #1023

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DLIO and Darshan #1023

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions