You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: HEALTH_CHECKS.md
+34-4
Original file line number
Diff line number
Diff line change
@@ -6,6 +6,8 @@ Here is a breakdown of the existing health checks:
6
6
- Description : Host-to-device connection speeds, one measurement per GPU. Codebase in tag [v12.4.1](https://github.com/NVIDIA/cuda-samples/tree/master/Samples/1_Utilities/bandwidthTest)
7
7
- Outputs: Pass/fail results based on PCIe bandwidth thresholds.
8
8
- Implementation: Compares bandwidth results to a threshold (e.g., 8 GB/s). If the measured bandwidth falls below the threshold, it triggers a failure.
9
+
- It is recommended to set a threshold that is 25% or lower of the expected peak PCIe bandwidth capability, which maps to maximum peak from 16 lanes to 4 lanes. For example, for a PCIe Gen4x16, reported peak bandwidth is 63GB/s. A degradation at 25% is 15.75GB/s, which corresponds to PCIe Gen4x4.
10
+
- The measured bandwidth is expected to be at least 80% of the expected peak PCIe generation bandwidth.
9
11
2.**GPU Memory Check (remapped)**
10
12
- Description: Information from nvidia-smi regarding GPU memory remapped rows.
11
13
- Outputs: Reports the state of GPU memory (normal/faulty).
@@ -35,14 +37,40 @@ These checks are configured to run periodically (e.g., hourly), and results are
35
37
36
38
## Deep Diagnostics and Node Labeling
37
39
38
-
Autopilot runs health checks periodically on GPU nodes, and if any of the health checks returns an error, the node is labeled with `autopilot.ibm.com/gpuhealth: ERR`. Otherwise, the label is set as `PASS`.
40
+
Autopilot's periodic health checks, will label the worker nodes according to the result obtained.
41
+
Lightweight and invasive health checks, may use different labeling system.
39
42
40
-
Also, more extensive tests, namely DCGM diagnostics level 3, are also executed automatically only on nodes that have free GPUs. This deeper analysis is needed to reveal problems in the GPUs that can be found only after running level 3 DCGM diagnostic.
43
+
If the health checks, lightweight or invasive, report success, the node is marked with
44
+
45
+
```yaml
46
+
autopilot.ibm.com/gpuhealth: PASS
47
+
```
48
+
49
+
When the lightweight health checks report an issue, the node is labelled with
50
+
51
+
```yaml
52
+
autopilot.ibm.com/gpuhealth: WARN
53
+
```
54
+
55
+
### Invasive health checks
56
+
57
+
The invasive DCGM diagnostics level 3 health check, executed automatically only on nodes that have free GPUs. This deeper analysis is needed to reveal problems in the GPUs that can be found only after running level 3 DCGM diagnostic.
41
58
This type of diagnostics can help deciding if the worker node should be used for running workloads or not. To facilitate this task, Autopilot will label nodes with key `autopilot.ibm.com/dcgm.level.3`.
42
59
43
-
If errors are found during the level 3 diagnostics, the label `autopilot.ibm.com/dcgm.level.3` will contain detailed information about the error in the following format:
60
+
If a fatal error is found, the `gpuhealth` label is updated to evict.
Only fatal errors should produce an `EVICT` label. We follow [NVIDIA recommendations](https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html#id3), although it is possible to customize the list of tests through the Helm chart. The default values are `[PCIe,NVLink,ECC,GPU Memory]`.
68
+
69
+
If errors are found during the level 3 diagnostics, the label `autopilot.ibm.com/dcgm.level.3` will contain detailed information about the error in the following format:
ifsuccessandnode_labels["autopilot.ibm.com/gpuhealth"] in ["PASS", "TESTING"]:
235
+
# If there is some other warning coming from other tests, i.e., ping or storage, we would overwrite this information. Let's play it safe at this point.
0 commit comments