Skip to content

Commit cd4fc6f

Browse files
authored
Readme updates (#69)
Signed-off-by: Claudia <[email protected]>
1 parent 2111acb commit cd4fc6f

14 files changed

+46
-36
lines changed

HEALTH_CHECKS.md

+5-1
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,8 @@ Here is a breakdown of the existing health checks:
3535

3636
These checks are configured to run periodically (e.g., hourly), and results are accessible via Prometheus, direct API queries or labels on the worker nodes.
3737

38+
![image](figures/periodic-check-flow.svg)
39+
3840
## Deep Diagnostics and Node Labeling
3941

4042
Autopilot's periodic health checks, will label the worker nodes according to the result obtained.
@@ -55,11 +57,13 @@ autopilot.ibm.com/gpuhealth: WARN
5557
### Invasive health checks
5658
5759
The invasive DCGM diagnostics level 3 health check, executed automatically only on nodes that have free GPUs. This deeper analysis is needed to reveal problems in the GPUs that can be found only after running level 3 DCGM diagnostic.
60+
61+
![image](figures/invasive-check-flow.svg)
62+
5863
This type of diagnostics can help deciding if the worker node should be used for running workloads or not. To facilitate this task, Autopilot will label nodes with key `autopilot.ibm.com/dcgm.level.3`.
5964

6065
If a fatal error is found, the `gpuhealth` label is updated to evict.
6166

62-
6367
```yaml
6468
autopilot.ibm.com/gpuhealth: EVICT
6569
```

README.md

+12-23
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,14 @@
11
# AI Training Autopilot
22

3-
[[README refresh ongoing]]
4-
53
Autopilot is a Kubernetes-native daemon that continuously monitors and evaluates GPUs, network and storage health, designed to detect and report infrastructure-level issues during the lifetime of AI workloads. It is an open-source project developed by IBM Research.
64

75
In AI training jobs, which may run for weeks or months, anomalies in the GPUs and network can happen anytime and often go undetected. In this case, performance degrades suddenly and a deep diagnostic is needed to identify the root cause, delaying or deleting the current job. Similarly, hardware anomalies can greatly disrupt the throughput and latency of an AI inference server.
86

9-
The role of Autopilot is to detect and report any problems that are detected during the lifetime of the job and the existence of a cluster.
7+
The role of Autopilot is to detect and report any problems that are detected by its health checks during the lifetime of the job and the existence of a cluster.
108

119
It implements a set of health checks evaluating the status of the system. These health checks focus mainly on subtle/software issues (i.e., row-remapping or PCIe link degradation), but also run connectivity tests (i.e., ping, iperf) to verify that secondary NICs are reachable. It can also verify that persistent volume claims (PVC) creation is functional for a given storage class.
1210

13-
![image](https://media.github.ibm.com/user/96687/files/0d466863-a19e-459d-a492-e2275377d4b9)
11+
![image](figures/autopilot-daemon-pod.png)
1412

1513
Autopilot is deployed as a Kubernetes DaemonSet on all worker nodes that have GPUs. Each pod exposes a Service that can be accessed through RESTful API to request the execution of health checks. Therefore, each health check has its own entry point, but also a generic “status” entry point is provided.
1614

@@ -20,25 +18,7 @@ The main code is written in Go, while health checks are written in a combination
2018

2119
If Autopilot requires full access to GPUs to run more invasive workloads, it will spawn a separate job with resources requests and limits set.
2220

23-
![image](https://media.github.ibm.com/user/96687/files/4a7c81ba-857a-43d4-bc82-0784ef81b270)
24-
25-
The toolkit currently provides health checks for pre-flight and post-flight phases, while in-flight checks will be enabled in the future. In more details (list subject to change):
26-
27-
- pre-flight checks
28-
29-
- validate infrastructure before the start of jobs
30-
31-
- in-flight checks
32-
33-
- workload and system performance is continuously monitored
34-
35-
- detect anomaly, and issue notification
36-
37-
- controllers can take actions if errors are found
38-
39-
- post-flight checks
40-
41-
- validate infrastructure once the job ends
21+
![image](figures/autopilot-main-loop.svg)
4222

4323
## Health Checks
4424

@@ -62,6 +42,15 @@ Results from health checks are exported as Prometheus Gauges, so that users and
6242

6343
Detailed description of all the health checks, can be found in [HEALTH_CHECKS.md](HEALTH_CHECKS.md).
6444

45+
### Diagnostics and Node Labeling
46+
47+
Autopilot's periodic and invasive health checks, will label the worker nodes according to the result obtained.
48+
Lightweight and invasive health checks, may use different labeling system. Refer to [HEALTH_CHECKS.md](HEALTH_CHECKS.md) for more details about the labels format.
49+
50+
The information saved in the labels, can be used by admins, kube-scheduler or other workload management systems like [CodeFlare](https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/) to steer the execution of workloads for enhanced fault tolerance.
51+
52+
![image](figures/big-picture.svg)
53+
6554
## Install
6655

6756
To learn how to install Autopilot, please refer to [SETUP.md](SETUP.md)

figures/autopilot-daemon-pod.pdf

474 KB
Binary file not shown.

figures/autopilot-daemon-pod.png

90.8 KB
Loading

figures/autopilot-main-loop.pdf

35.7 KB
Binary file not shown.

figures/autopilot-main-loop.svg

+3
Loading

figures/big-picture.pdf

68.9 KB
Binary file not shown.

0 commit comments

Comments
 (0)