Skip to content

Commit b74c436

Browse files
authored
Node Label Updates (#66)
Signed-off-by: Claudia <[email protected]>
1 parent a8231e6 commit b74c436

16 files changed

+461
-282
lines changed

.github/PULL_REQUEST_TEMPLATE.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
<!--
22
3-
-=-=-=-= Replace the itallic content with your own. -=-=-=-=
3+
-=-=-=-= Replace the italic content with your own. -=-=-=-=
44
-=-=-=-= Don't forget to update the GitHub Issue -=-=-=-=
55
-=-=-=-= your Pull-Request pertains to! -=-=-=-=
66

HEALTH_CHECKS.md

+34-4
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@ Here is a breakdown of the existing health checks:
66
- Description : Host-to-device connection speeds, one measurement per GPU. Codebase in tag [v12.4.1](https://github.com/NVIDIA/cuda-samples/tree/master/Samples/1_Utilities/bandwidthTest)
77
- Outputs: Pass/fail results based on PCIe bandwidth thresholds.
88
- Implementation: Compares bandwidth results to a threshold (e.g., 8 GB/s). If the measured bandwidth falls below the threshold, it triggers a failure.
9+
- It is recommended to set a threshold that is 25% or lower of the expected peak PCIe bandwidth capability, which maps to maximum peak from 16 lanes to 4 lanes. For example, for a PCIe Gen4x16, reported peak bandwidth is 63GB/s. A degradation at 25% is 15.75GB/s, which corresponds to PCIe Gen4x4.
10+
- The measured bandwidth is expected to be at least 80% of the expected peak PCIe generation bandwidth.
911
2. **GPU Memory Check (remapped)**
1012
- Description: Information from nvidia-smi regarding GPU memory remapped rows.
1113
- Outputs: Reports the state of GPU memory (normal/faulty).
@@ -35,14 +37,40 @@ These checks are configured to run periodically (e.g., hourly), and results are
3537

3638
## Deep Diagnostics and Node Labeling
3739

38-
Autopilot runs health checks periodically on GPU nodes, and if any of the health checks returns an error, the node is labeled with `autopilot.ibm.com/gpuhealth: ERR`. Otherwise, the label is set as `PASS`.
40+
Autopilot's periodic health checks, will label the worker nodes according to the result obtained.
41+
Lightweight and invasive health checks, may use different labeling system.
3942

40-
Also, more extensive tests, namely DCGM diagnostics level 3, are also executed automatically only on nodes that have free GPUs. This deeper analysis is needed to reveal problems in the GPUs that can be found only after running level 3 DCGM diagnostic.
43+
If the health checks, lightweight or invasive, report success, the node is marked with
44+
45+
```yaml
46+
autopilot.ibm.com/gpuhealth: PASS
47+
```
48+
49+
When the lightweight health checks report an issue, the node is labelled with
50+
51+
```yaml
52+
autopilot.ibm.com/gpuhealth: WARN
53+
```
54+
55+
### Invasive health checks
56+
57+
The invasive DCGM diagnostics level 3 health check, executed automatically only on nodes that have free GPUs. This deeper analysis is needed to reveal problems in the GPUs that can be found only after running level 3 DCGM diagnostic.
4158
This type of diagnostics can help deciding if the worker node should be used for running workloads or not. To facilitate this task, Autopilot will label nodes with key `autopilot.ibm.com/dcgm.level.3`.
4259

43-
If errors are found during the level 3 diagnostics, the label `autopilot.ibm.com/dcgm.level.3` will contain detailed information about the error in the following format:
60+
If a fatal error is found, the `gpuhealth` label is updated to evict.
4461

45-
`ERR_Year-Month-Date_Hour.Minute.UTC_Diagnostic_Test.gpuID,Diagnostic_Test.gpuID,...`
62+
63+
```yaml
64+
autopilot.ibm.com/gpuhealth: EVICT
65+
```
66+
67+
Only fatal errors should produce an `EVICT` label. We follow [NVIDIA recommendations](https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html#id3), although it is possible to customize the list of tests through the Helm chart. The default values are `[PCIe,NVLink,ECC,GPU Memory]`.
68+
69+
If errors are found during the level 3 diagnostics, the label `autopilot.ibm.com/dcgm.level.3` will contain detailed information about the error in the following format:
70+
71+
```yaml
72+
autopilot.ibm.com/dcgm.level.3: ERR_Year-Month-Date_Hour.Minute.UTC_Diagnostic_Test.gpuID,Diagnostic_Test.gpuID,...`
73+
```
4674

4775
- `ERR`: An indicator that an error has occurred
4876
- `Year-Month-Date_Hour.Minute.UTC`: Timestamp of completed diagnostics
@@ -63,3 +91,5 @@ All metrics are accessible through Prometheus and Grafana dashboards. The gauge
6391
- `node`, filter by node name
6492
- `cpumodel` and `gpumodel`, for heterogeneous clusters
6593
- `deviceid` to select specific GPUs, when available
94+
95+
For more information on how to set up alerts based on metrics, please refer to the [alert manager folder](alertmanager/README.md).

autopilot-daemon/gpu-dcgm/entrypoint.py

+23-5
Original file line numberDiff line numberDiff line change
@@ -15,11 +15,12 @@
1515
parser.add_argument('-r', '--run', type=str, default='1')
1616
parser.add_argument('-l', '--label_node', action='store_true')
1717
args = parser.parse_args()
18+
1819
def main():
1920
output = os.popen('bash ./utils/briefings.sh')
2021
result = output.read()
2122
print(result)
22-
23+
2324
if "ABORT" not in result:
2425
print("[[ DCGM ]] Briefings completed. Continue with dcgm evaluation.")
2526
command = ['dcgmi', 'diag', '-j', '-r', args.run]
@@ -217,11 +218,29 @@ def patch_node(success, output):
217218
timestamp = now.strftime("%Y-%m-%d_%H.%M.%SUTC")
218219
result = ""
219220
general_health = "PASS"
220-
if success:
221-
result = "PASS_"+timestamp
221+
try:
222+
k8s_node = v1.read_node(nodename)
223+
except ApiException as e:
224+
print("Exception when calling corev1api->read_node: %s\n" % e)
225+
exit()
226+
227+
node_labels = k8s_node.metadata.labels
228+
if os.getenv("DCGM_FATAL_ERRORS") == "":
229+
# Only fatal errors should produce an EVICT label. Based on https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html#id3
230+
dcgm_fatal_errors = ['PCIe','NVLink','ECC','GPU Memory']
222231
else:
232+
dcgm_fatal_errors = os.getenv("DCGM_FATAL_ERRORS")
233+
234+
if success and node_labels["autopilot.ibm.com/gpuhealth"] in ["PASS", "TESTING"]:
235+
# If there is some other warning coming from other tests, i.e., ping or storage, we would overwrite this information. Let's play it safe at this point.
236+
result = "PASS_"+timestamp
237+
elif not success:
223238
result = "ERR_"+timestamp+"_"+output
224-
general_health = "EVICT"
239+
general_health = "WARN"
240+
for error in dcgm_fatal_errors:
241+
unified = unify_string_format(error)
242+
if unified in result:
243+
general_health = "EVICT"
225244

226245
label = {
227246
"metadata": {
@@ -230,7 +249,6 @@ def patch_node(success, output):
230249
"autopilot.ibm.com/gpuhealth": general_health}
231250
}
232251
}
233-
print("label: ", result)
234252
try:
235253
api_response = v1.patch_node(nodename, label)
236254
except ApiException as e:

autopilot-daemon/gpu-power/power-throttle.sh

+8
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,12 @@
11
#!/bin/bash
2+
OUT="$(bash /home/autopilot/utils/briefings.sh | grep ABORT)"
3+
echo ${OUT}
4+
if [[ ! -z $OUT ]]; then
5+
echo "[[GPU POWER]] ABORT"
6+
exit 0
7+
fi
8+
echo "[[GPU POWER]] Briefings completed. Continue with power cap evaluation."
9+
210
RES=$(ls -d /dev/nvidia* 2>1)
311
numre='^[0-9]+$'
412
D=-1

autopilot-daemon/pkg/cmd/main.go

+23-19
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,8 @@ import (
88
"os"
99
"time"
1010

11-
"github.com/IBM/autopilot/pkg/handlers"
11+
"github.com/IBM/autopilot/pkg/handler"
12+
"github.com/IBM/autopilot/pkg/healthcheck"
1213
"github.com/IBM/autopilot/pkg/utils"
1314
"github.com/prometheus/client_golang/prometheus"
1415
"github.com/prometheus/client_golang/prometheus/promhttp"
@@ -17,7 +18,7 @@ import (
1718

1819
func main() {
1920
port := flag.String("port", "3333", "Port for the webhook to listen to. Defaulted to 3333")
20-
bwThreshold := flag.String("bw", "4", "Sets bandwidth threshold for the init container")
21+
bwThreshold := flag.Int("bw", 4, "Sets bandwidth threshold for the init container")
2122
logFile := flag.String("logfile", "", "File where to save all the events")
2223
v := flag.String("loglevel", "2", "Log level")
2324
repeat := flag.Int("w", 24, "Run all tests periodically on each node. Time set in hours. Defaults to 24h")
@@ -45,6 +46,9 @@ func main() {
4546

4647
utils.InitHardwareMetrics()
4748

49+
// Init the node status map
50+
healthcheck.InitNodeStatusMap()
51+
4852
pMux := http.NewServeMux()
4953
promHandler := promhttp.HandlerFor(reg, promhttp.HandlerOpts{})
5054
pMux.Handle("/metrics", promHandler)
@@ -59,7 +63,7 @@ func main() {
5963
}()
6064

6165
readinessMux := http.NewServeMux()
62-
readinessMux.Handle("/readinessprobe", handlers.ReadinessProbeHandler())
66+
readinessMux.Handle("/readinessprobe", handler.ReadinessProbeHandler())
6367

6468
go func() {
6569
klog.Info("Serving Readiness Probe on :8080")
@@ -72,19 +76,19 @@ func main() {
7276

7377
hcMux := http.NewServeMux()
7478

75-
hcMux.Handle("/dcgm", handlers.DCGMHandler())
76-
hcMux.Handle("/gpumem", handlers.GpuMemHandler())
77-
hcMux.Handle("/gpupower", handlers.GpuPowerHandler())
78-
hcMux.Handle("/iperf", handlers.IperfHandler())
79-
hcMux.Handle("/iperfservers", handlers.StartIperfServersHandler())
80-
hcMux.Handle("/iperfstopservers", handlers.StopAllIperfServersHandler())
81-
hcMux.Handle("/iperfclients", handlers.StartIperfClientsHandler())
82-
hcMux.Handle("/invasive", handlers.InvasiveCheckHandler())
83-
hcMux.Handle("/pciebw", handlers.PCIeBWHandler(utils.UserConfig.BWThreshold))
84-
hcMux.Handle("/ping", handlers.PingHandler())
85-
hcMux.Handle("/pvc", handlers.PVCHandler())
86-
hcMux.Handle("/remapped", handlers.RemappedRowsHandler())
87-
hcMux.Handle("/status", handlers.SystemStatusHandler())
79+
hcMux.Handle("/dcgm", handler.DCGMHandler())
80+
hcMux.Handle("/gpumem", handler.GpuMemHandler())
81+
hcMux.Handle("/gpupower", handler.GpuPowerHandler())
82+
hcMux.Handle("/iperf", handler.IperfHandler())
83+
hcMux.Handle("/iperfservers", handler.StartIperfServersHandler())
84+
hcMux.Handle("/iperfstopservers", handler.StopAllIperfServersHandler())
85+
hcMux.Handle("/iperfclients", handler.StartIperfClientsHandler())
86+
hcMux.Handle("/invasive", handler.InvasiveCheckHandler())
87+
hcMux.Handle("/pciebw", handler.PCIeBWHandler())
88+
hcMux.Handle("/ping", handler.PingHandler())
89+
hcMux.Handle("/pvc", handler.PVCHandler())
90+
hcMux.Handle("/remapped", handler.RemappedRowsHandler())
91+
hcMux.Handle("/status", handler.SystemStatusHandler())
8892

8993
s := &http.Server{
9094
Addr: ":" + *port,
@@ -119,7 +123,7 @@ func main() {
119123
go utils.WatchNode()
120124

121125
// Run the health checks at startup, then start the timer
122-
handlers.PeriodicCheck()
126+
healthcheck.PeriodicCheck()
123127

124128
periodicChecksTicker := time.NewTicker(time.Duration(*repeat) * time.Hour)
125129
defer periodicChecksTicker.Stop()
@@ -128,10 +132,10 @@ func main() {
128132
for {
129133
select {
130134
case <-periodicChecksTicker.C:
131-
handlers.PeriodicCheck()
135+
healthcheck.PeriodicCheck()
132136
case <-invasiveChecksTicker.C:
133137
if *invasive > 0 {
134-
handlers.InvasiveCheck()
138+
healthcheck.InvasiveCheck()
135139
}
136140
}
137141
}

0 commit comments

Comments
 (0)