Skip to content

Commit 2111acb

Browse files
authored
Update README.md
Signed-off-by: Claudia Misale <[email protected]>
1 parent b051b4a commit 2111acb

File tree

1 file changed

+2
-0
lines changed

1 file changed

+2
-0
lines changed

README.md

+2
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
# AI Training Autopilot
22

3+
[[README refresh ongoing]]
4+
35
Autopilot is a Kubernetes-native daemon that continuously monitors and evaluates GPUs, network and storage health, designed to detect and report infrastructure-level issues during the lifetime of AI workloads. It is an open-source project developed by IBM Research.
46

57
In AI training jobs, which may run for weeks or months, anomalies in the GPUs and network can happen anytime and often go undetected. In this case, performance degrades suddenly and a deep diagnostic is needed to identify the root cause, delaying or deleting the current job. Similarly, hardware anomalies can greatly disrupt the throughput and latency of an AI inference server.

0 commit comments

Comments
 (0)