You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: HEALTH_CHECKS.md
+5-1
Original file line number
Diff line number
Diff line change
@@ -35,6 +35,8 @@ Here is a breakdown of the existing health checks:
35
35
36
36
These checks are configured to run periodically (e.g., hourly), and results are accessible via Prometheus, direct API queries or labels on the worker nodes.
37
37
38
+

39
+
38
40
## Deep Diagnostics and Node Labeling
39
41
40
42
Autopilot's periodic health checks, will label the worker nodes according to the result obtained.
The invasive DCGM diagnostics level 3 health check, executed automatically only on nodes that have free GPUs. This deeper analysis is needed to reveal problems in the GPUs that can be found only after running level 3 DCGM diagnostic.
60
+
61
+

62
+
58
63
This type of diagnostics can help deciding if the worker node should be used for running workloads or not. To facilitate this task, Autopilot will label nodes with key `autopilot.ibm.com/dcgm.level.3`.
59
64
60
65
If a fatal error is found, the `gpuhealth` label is updated to evict.
Copy file name to clipboardexpand all lines: README.md
+12-23
Original file line number
Diff line number
Diff line change
@@ -1,16 +1,14 @@
1
1
# AI Training Autopilot
2
2
3
-
[[README refresh ongoing]]
4
-
5
3
Autopilot is a Kubernetes-native daemon that continuously monitors and evaluates GPUs, network and storage health, designed to detect and report infrastructure-level issues during the lifetime of AI workloads. It is an open-source project developed by IBM Research.
6
4
7
5
In AI training jobs, which may run for weeks or months, anomalies in the GPUs and network can happen anytime and often go undetected. In this case, performance degrades suddenly and a deep diagnostic is needed to identify the root cause, delaying or deleting the current job. Similarly, hardware anomalies can greatly disrupt the throughput and latency of an AI inference server.
8
6
9
-
The role of Autopilot is to detect and report any problems that are detected during the lifetime of the job and the existence of a cluster.
7
+
The role of Autopilot is to detect and report any problems that are detected by its health checks during the lifetime of the job and the existence of a cluster.
10
8
11
9
It implements a set of health checks evaluating the status of the system. These health checks focus mainly on subtle/software issues (i.e., row-remapping or PCIe link degradation), but also run connectivity tests (i.e., ping, iperf) to verify that secondary NICs are reachable. It can also verify that persistent volume claims (PVC) creation is functional for a given storage class.
Autopilot is deployed as a Kubernetes DaemonSet on all worker nodes that have GPUs. Each pod exposes a Service that can be accessed through RESTful API to request the execution of health checks. Therefore, each health check has its own entry point, but also a generic “status” entry point is provided.
16
14
@@ -20,25 +18,7 @@ The main code is written in Go, while health checks are written in a combination
20
18
21
19
If Autopilot requires full access to GPUs to run more invasive workloads, it will spawn a separate job with resources requests and limits set.
The toolkit currently provides health checks for pre-flight and post-flight phases, while in-flight checks will be enabled in the future. In more details (list subject to change):
26
-
27
-
- pre-flight checks
28
-
29
-
- validate infrastructure before the start of jobs
30
-
31
-
- in-flight checks
32
-
33
-
- workload and system performance is continuously monitored
34
-
35
-
- detect anomaly, and issue notification
36
-
37
-
- controllers can take actions if errors are found
38
-
39
-
- post-flight checks
40
-
41
-
- validate infrastructure once the job ends
21
+

42
22
43
23
## Health Checks
44
24
@@ -62,6 +42,15 @@ Results from health checks are exported as Prometheus Gauges, so that users and
62
42
63
43
Detailed description of all the health checks, can be found in [HEALTH_CHECKS.md](HEALTH_CHECKS.md).
64
44
45
+
### Diagnostics and Node Labeling
46
+
47
+
Autopilot's periodic and invasive health checks, will label the worker nodes according to the result obtained.
48
+
Lightweight and invasive health checks, may use different labeling system. Refer to [HEALTH_CHECKS.md](HEALTH_CHECKS.md) for more details about the labels format.
49
+
50
+
The information saved in the labels, can be used by admins, kube-scheduler or other workload management systems like [CodeFlare](https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/) to steer the execution of workloads for enhanced fault tolerance.
51
+
52
+

53
+
65
54
## Install
66
55
67
56
To learn how to install Autopilot, please refer to [SETUP.md](SETUP.md)
0 commit comments