I am trying to run MLCommons DLRM on 2 nodes with H100 GPUs. How do i enable profiling on this to get information on time taken in compute, communication and storage?
Mostly interested in getting a clear understanding of how much time is taken by each component of our infrastructure