A practical monitoring and reporting tool that reports how Nvidia GPU resources are actually exercised during production jobs.
The analysis tool depends on uv to create a python environment with the necessary dependencies. If uv is not installed, run
make install-uvTo create the gssr-record executable run
makeA simple example to test the setup is to record a few seconds of no activity.
Here we fist run the sleep command for 30 seconds and store the recorded data in a directory called gr-sleep-test.
The PDF report is created using the data in that directory.
./gssr-record sleep 30
./gssr-analyze gssr-report -o gr-sleep-test-report.pdfUsage: gssr-record [OPTIONS] <cmd> [args...]
Run cmd and record GPU metrics. Results can be given to the
GPU saturation scorer (GSSR) to produce a report for CSCS project proposals.
-h | --help Display this help message.
--version Show version information.
-o <directory> Create directory and write results there.
gssr-record depends on the NVIDIA DCGM library. When running in a
container at CSCS you will need the Container Engine annotation in
your container EDF file:
[annotations]
com.hooks.dcgm.enabled = "true"usage: gssr-analyze [-h] [-o OUTPUT] [directory ...]
Generate a report from GPU metrics collected with gssr-record.
positional arguments:
directory Top level directories created by gssr-record
options:
-h, --help show this help message and exit
-o OUTPUT, --output OUTPUT
Output PDF filename (default: gssr-report.pdf)- Only one rank per node will record metrics for all local GPUs
- Overlapping slurm jobs will record the same data for the same GPUs