-
Notifications
You must be signed in to change notification settings - Fork 37
Open
Description
I was trying out Darshan with DLIO after reading the paper I/O in Machine Learning Applications on HPC Systems: A 360-degree Survey.
I noticed that Darshan (in DXT mode) captures read operations but not write, though my DLIO workload generates large checkpoints:
> du -h checkpoints/resnet50_my_a100_pytorch
36G checkpoints/resnet50_my_a100_pytorch/global_epoch9_step3
36G checkpoints/resnet50_my_a100_pytorch/global_epoch7_step3
36G checkpoints/resnet50_my_a100_pytorch/global_epoch1_step3
36G checkpoints/resnet50_my_a100_pytorch/global_epoch3_step3
36G checkpoints/resnet50_my_a100_pytorch/global_epoch5_step3
178G checkpoints/resnet50_my_a100_pytorchAlso, DLIO shows that are generated (all ranks read and write):
[OUTPUT] 2025-05-02T10:11:04.936647 Starting data generation
[OUTPUT] 2025-05-02T10:11:06.154185 Generation done
[OUTPUT] 2025-05-02T10:11:06.217784 Total number of parameters in the model: 3172149248
[OUTPUT] 2025-05-02T10:11:06.307744 Model size: 0.0000 GB
[OUTPUT] 2025-05-02T10:11:06.307949 Optimizer state size: 35.4625 GB
[OUTPUT] 2025-05-02T10:11:06.308028 Total checkpoint size: 35.4625 GB
[OUTPUT] 2025-05-02T10:11:06.308140 Max steps per epoch: 2 = 100 * 1024 / 400 / 96 (samples per file * num fil
es / batch size / comm size)
[OUTPUT] 2025-05-02T10:11:06.346457 Starting epoch 1: 2 steps expected
[OUTPUT] 2025-05-02T10:11:06.346818 Starting block 1
...
[OUTPUT] 2025-05-02T10:12:19.556005 Starting saving checkpoint 1 after total step 2 for epoch 7
[OUTPUT] 2025-05-02T10:12:19.591650 Saved model checkpoint in 0.0011 seconds
[OUTPUT] 2025-05-02T10:12:20.432879 Saved optimizer checkpoint in 0.8411 seconds
[OUTPUT] 2025-05-02T10:12:20.433231 Finished saving checkpoint 1 for epoch 7 in 0.8772 s; Throughput: 40.4257 GB/s
[OUTPUT] 2025-05-02T10:12:20.441221 Ending epoch 7 - 2 steps completed in 10.49 s
[OUTPUT] 2025-05-02T10:12:21.130292 Starting epoch 8: 2 steps expected
[OUTPUT] 2025-05-02T10:12:21.130710 Starting block 1
[OUTPUT] 2025-05-02T10:12:31.063011 Ending epoch 8 - 2 steps completed in 9.93 s
[OUTPUT] 2025-05-02T10:12:31.082444 Starting epoch 9: 2 steps expected
[OUTPUT] 2025-05-02T10:12:31.082824 Starting block 1
[OUTPUT] 2025-05-02T10:12:40.363424 Ending block 1 - 2 steps completed in 9.28 s
[OUTPUT] 2025-05-02T10:12:40.365390 Epoch 9 - Block 1 [Training] Accelerator Utilization [AU] (%): 64.6117
[OUTPUT] 2025-05-02T10:12:40.365474 Epoch 9 - Block 1 [Training] Throughput (samples/second): 0.0000
[OUTPUT] 2025-05-02T10:12:40.365556 Epoch 9 - Block 1 [Training] Computation time per step (second): 0.4351+/-0.0000 (set value: {'mean': 0.435})
[OUTPUT] 2025-05-02T10:12:40.365892 Starting saving checkpoint 1 after total step 2 for epoch 9
[OUTPUT] 2025-05-02T10:12:40.398263 Saved model checkpoint in 0.0018 seconds
[OUTPUT] 2025-05-02T10:12:41.964534 Saved optimizer checkpoint in 1.5661 seconds
[OUTPUT] 2025-05-02T10:12:41.964792 Finished saving checkpoint 1 for epoch 9 in 1.5989 s; Throughput: 22.1793 GB/s
[OUTPUT] 2025-05-02T10:12:41.972896 Ending epoch 9 - 2 steps completed in 10.89 s
[OUTPUT] 2025-05-02T10:12:42.022377 Starting epoch 10: 2 steps expected
[OUTPUT] 2025-05-02T10:12:42.022745 Starting block 1
[OUTPUT] 2025-05-02T10:12:51.297347 Ending epoch 10 - 2 steps completed in 9.27 s
[OUTPUT] 2025-05-02T10:12:51.327897 Saved outputs in <hidden>/results/darshan/hydra_log/resnet50/2025-05-02-10-10-43
[OUTPUT] Averaged metric over all steps/epochs
[METRIC] ==========================================================
[METRIC] Number of Simulated Accelerators: 96
[METRIC] Training Accelerator Utilization [AU] (%): 23.1229 (24.1460)
[METRIC] Training Throughput (samples/second): 0.0000 (0.0000)
[METRIC] Training I/O Throughput (MB/second): 0.0000 (0.0000)
[METRIC] train_au_meet_expectation: fail
[METRIC] Checkpoint save duration (seconds): 0.9046 (0.3615)
[METRIC] Checkpoint save I/O Throughput (GB/second): 44.2668 (13.3415)
[METRIC] ==========================================================
With the metric proxy and a similar run, I could see the writes by capturing data using a modified strace:
I could share the .darshan file if needed (Github does not allow it). The profiler also confirms this info:


It seems that it is missing the writes.
Metadata
Metadata
Assignees
Labels
No labels

