You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+61-33
Original file line number
Diff line number
Diff line change
@@ -9,9 +9,10 @@ Official documentation for DCGM-Exporter can be found on [docs.nvidia.com](https
9
9
### Quickstart
10
10
11
11
To gather metrics on a GPU node, simply start the `dcgm-exporter` container:
12
-
```
13
-
$ docker run -d --gpus all --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.1-ubuntu22.04
14
-
$ curl localhost:9400/metrics
12
+
13
+
```shell
14
+
docker run -d --gpus all --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:3.3.6-3.4.2-ubuntu22.04
15
+
curl localhost:9400/metrics
15
16
# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
16
17
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
17
18
# HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).
@@ -32,33 +33,38 @@ Note: Consider using the [NVIDIA GPU Operator](https://github.com/NVIDIA/gpu-ope
32
33
Ensure you have already setup your cluster with the [default runtime as NVIDIA](https://github.com/NVIDIA/nvidia-container-runtime#docker-engine-setup).
33
34
34
35
The recommended way to install DCGM-Exporter is to use the Helm chart:
To integrate DCGM-Exporter with Prometheus and Grafana, see the full instructions in the [user guide](https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/).
76
83
`dcgm-exporter` is deployed as part of the GPU Operator. To get started with integrating with Prometheus, check the Operator [user guide](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#gpu-telemetry).
77
84
78
85
### TLS and Basic Auth
79
86
80
87
Exporter supports TLS and basic auth using [exporter-toolkit](https://github.com/prometheus/exporter-toolkit). To use TLS and/or basic auth, users need to use `--web-config-file` CLI flag as follows
81
88
82
-
```
89
+
```shell
83
90
dcgm-exporter --web-config-file=web-config.yaml
84
91
```
85
92
86
93
A sample `web-config.yaml` file can be fetched from [exporter-toolkit repository](https://github.com/prometheus/exporter-toolkit/blob/master/docs/web-config.yml). The reference of the `web-config.yaml` file can be consulted in the [docs](https://github.com/prometheus/exporter-toolkit/blob/master/docs/web-configuration.md).
87
94
95
+
### How to include HPC jobs in metric labels
96
+
97
+
The DCGM-exporter can include High-Performance Computing (HPC) job information into its metric labels. To achieve this, HPC environment administrators must configure their HPC environment to generate files that map GPUs to HPC jobs.
98
+
99
+
#### File Conventions
100
+
101
+
These mapping files follow a specific format:
102
+
103
+
* Each file is named after a unique GPU ID (e.g., 0, 1, 2, etc.).
104
+
* Each line in the file contains JOB IDs that run on the corresponding GPU.
105
+
106
+
#### Enabling HPC Job Mapping on DCGM-Exporter
107
+
108
+
To enable GPU-to-job mapping on the DCGM-exporter side, users must run the DCGM-exporter with the --hpc-job-mapping-dir command-line parameter, pointing to a directory where the HPC cluster creates job mapping files. Or, users can set the environment variable DCGM_HPC_JOB_MAPPING_DIR to achieve the same result.
109
+
88
110
### Building from Source
89
111
90
112
In order to build dcgm-exporter ensure you have the following:
# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
103
126
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
104
127
# HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).
@@ -118,6 +141,7 @@ With `dcgm-exporter` you can configure which fields are collected by specifying
118
141
You will find the default CSV file under `etc/default-counters.csv` in the repository, which is copied on your system or container to `/etc/dcgm-exporter/default-counters.csv`
119
142
120
143
The layout and format of this file is as follows:
144
+
121
145
```
122
146
# Format
123
147
# If line starts with a '#' it is considered a comment
@@ -129,39 +153,43 @@ DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).
129
153
```
130
154
131
155
A custom csv file can be specified using the `-f` option or `--collectors` as follows:
132
-
```
133
-
$ dcgm-exporter -f /tmp/custom-collectors.csv
156
+
157
+
```shell
158
+
dcgm-exporter -f /tmp/custom-collectors.csv
134
159
```
135
160
136
161
Notes:
137
-
- Always make sure your entries have 2 commas (',')
138
-
- The complete list of counters that can be collected can be found on the DCGM API reference manual: https://docs.nvidia.com/datacenter/dcgm/latest/dcgm-api/dcgm-api-field-ids.html
162
+
163
+
* Always make sure your entries have 2 commas (',')
164
+
* The complete list of counters that can be collected can be found on the DCGM API reference manual: <https://docs.nvidia.com/datacenter/dcgm/latest/dcgm-api/dcgm-api-field-ids.html>
139
165
140
166
### What about a Grafana Dashboard?
141
167
142
-
You can find the official NVIDIA DCGM-Exporter dashboard here: https://grafana.com/grafana/dashboards/12239
168
+
You can find the official NVIDIA DCGM-Exporter dashboard here: <https://grafana.com/grafana/dashboards/12239>
143
169
144
170
You will also find the `json` file on this repo under `grafana/dcgm-exporter-dashboard.json`
145
171
146
172
Pull requests are accepted!
147
173
148
-
149
174
### Building the containers
150
175
151
176
This project uses [docker buildx](https://docs.docker.com/buildx/working-with-buildx/) for multi-arch image creation. Follow the instructions on that page to get a working builder instance for creating these containers. Some other useful build options follow.
152
177
153
178
Builds local images based on the machine architecture and makes them available in 'docker images'
179
+
154
180
```
155
181
make local
156
182
```
157
183
158
184
Build the ubuntu image and export to 'docker images'
159
-
```
185
+
186
+
```shell
160
187
make ubuntu22.04 PLATFORMS=linux/amd64 OUTPUT=type=docker
161
188
```
162
189
163
190
Build and push the images to some other 'private_registry'
0 commit comments