Skip to content

DLIO and Darshan #1023

@A-Tarraf

Description

@A-Tarraf

I was trying out Darshan with DLIO after reading the paper I/O in Machine Learning Applications on HPC Systems: A 360-degree Survey.

I noticed that Darshan (in DXT mode) captures read operations but not write, though my DLIO workload generates large checkpoints:

> du -h checkpoints/resnet50_my_a100_pytorch 

36G	checkpoints/resnet50_my_a100_pytorch/global_epoch9_step3
36G	checkpoints/resnet50_my_a100_pytorch/global_epoch7_step3
36G	checkpoints/resnet50_my_a100_pytorch/global_epoch1_step3
36G	checkpoints/resnet50_my_a100_pytorch/global_epoch3_step3
36G	checkpoints/resnet50_my_a100_pytorch/global_epoch5_step3
178G	checkpoints/resnet50_my_a100_pytorch

Also, DLIO shows that are generated (all ranks read and write):

[OUTPUT] 2025-05-02T10:11:04.936647 Starting data generation
[OUTPUT] 2025-05-02T10:11:06.154185 Generation done
[OUTPUT] 2025-05-02T10:11:06.217784 Total number of parameters in the model: 3172149248
[OUTPUT] 2025-05-02T10:11:06.307744 Model size: 0.0000 GB
[OUTPUT] 2025-05-02T10:11:06.307949 Optimizer state size: 35.4625 GB
[OUTPUT] 2025-05-02T10:11:06.308028 Total checkpoint size: 35.4625 GB
[OUTPUT] 2025-05-02T10:11:06.308140 Max steps per epoch: 2 = 100 * 1024 / 400 / 96 (samples per file * num fil
es / batch size / comm size)
[OUTPUT] 2025-05-02T10:11:06.346457 Starting epoch 1: 2 steps expected
[OUTPUT] 2025-05-02T10:11:06.346818 Starting block 1
...
[OUTPUT] 2025-05-02T10:12:19.556005 Starting saving checkpoint 1 after total step 2 for epoch 7
[OUTPUT] 2025-05-02T10:12:19.591650 Saved model checkpoint in 0.0011 seconds
[OUTPUT] 2025-05-02T10:12:20.432879 Saved optimizer checkpoint in 0.8411 seconds
[OUTPUT] 2025-05-02T10:12:20.433231 Finished saving checkpoint 1 for epoch 7 in 0.8772 s; Throughput: 40.4257 GB/s
[OUTPUT] 2025-05-02T10:12:20.441221 Ending epoch 7 - 2 steps completed in 10.49 s
[OUTPUT] 2025-05-02T10:12:21.130292 Starting epoch 8: 2 steps expected
[OUTPUT] 2025-05-02T10:12:21.130710 Starting block 1
[OUTPUT] 2025-05-02T10:12:31.063011 Ending epoch 8 - 2 steps completed in 9.93 s
[OUTPUT] 2025-05-02T10:12:31.082444 Starting epoch 9: 2 steps expected
[OUTPUT] 2025-05-02T10:12:31.082824 Starting block 1
[OUTPUT] 2025-05-02T10:12:40.363424 Ending block 1 - 2 steps completed in 9.28 s
[OUTPUT] 2025-05-02T10:12:40.365390 Epoch 9 - Block 1 [Training] Accelerator Utilization [AU] (%): 64.6117
[OUTPUT] 2025-05-02T10:12:40.365474 Epoch 9 - Block 1 [Training] Throughput (samples/second): 0.0000
[OUTPUT] 2025-05-02T10:12:40.365556 Epoch 9 - Block 1 [Training] Computation time per step (second): 0.4351+/-0.0000 (set value: {'mean': 0.435})
[OUTPUT] 2025-05-02T10:12:40.365892 Starting saving checkpoint 1 after total step 2 for epoch 9
[OUTPUT] 2025-05-02T10:12:40.398263 Saved model checkpoint in 0.0018 seconds
[OUTPUT] 2025-05-02T10:12:41.964534 Saved optimizer checkpoint in 1.5661 seconds
[OUTPUT] 2025-05-02T10:12:41.964792 Finished saving checkpoint 1 for epoch 9 in 1.5989 s; Throughput: 22.1793 GB/s
[OUTPUT] 2025-05-02T10:12:41.972896 Ending epoch 9 - 2 steps completed in 10.89 s
[OUTPUT] 2025-05-02T10:12:42.022377 Starting epoch 10: 2 steps expected
[OUTPUT] 2025-05-02T10:12:42.022745 Starting block 1
[OUTPUT] 2025-05-02T10:12:51.297347 Ending epoch 10 - 2 steps completed in 9.27 s
[OUTPUT] 2025-05-02T10:12:51.327897 Saved outputs in  <hidden>/results/darshan/hydra_log/resnet50/2025-05-02-10-10-43
[OUTPUT] Averaged metric over all steps/epochs
[METRIC] ==========================================================
[METRIC] Number of Simulated Accelerators: 96 
[METRIC] Training Accelerator Utilization [AU] (%): 23.1229 (24.1460)
[METRIC] Training Throughput (samples/second): 0.0000 (0.0000)
[METRIC] Training I/O Throughput (MB/second): 0.0000 (0.0000)
[METRIC] train_au_meet_expectation: fail
[METRIC] Checkpoint save duration (seconds): 0.9046 (0.3615)
[METRIC] Checkpoint save I/O Throughput (GB/second): 44.2668 (13.3415)
[METRIC] ==========================================================

With the metric proxy and a similar run, I could see the writes by capturing data using a modified strace:

Image
Image

I could share the .darshan file if needed (Github does not allow it). The profiler also confirms this info:
Image
Image

It seems that it is missing the writes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions