Skip to content

Commit 0099963

Browse files
authored
Merge branch 'main' into ladithyav-dependabot-10-03
2 parents cdaadb1 + d207b26 commit 0099963

File tree

18 files changed

+2202
-2
lines changed

18 files changed

+2202
-2
lines changed

.ko.yaml

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -237,3 +237,20 @@ builds:
237237
org.opencontainers.image.version: "{{.Env.VERSION}}"
238238
org.opencontainers.image.revision: "{{.Env.GIT_COMMIT}}"
239239
org.opencontainers.image.created: "{{.Env.BUILD_DATE}}"
240+
241+
- id: slurm-drain-monitor
242+
dir: health-monitors/slurm-drain-monitor
243+
main: .
244+
ldflags:
245+
- "-s -w"
246+
- "-X main.version={{.Env.VERSION}} -X main.commit={{.Env.GIT_COMMIT}} -X main.date={{.Env.BUILD_DATE}}"
247+
annotations:
248+
org.opencontainers.image.description: "Slurm drain monitor for detecting external Slurm drains and publishing health events"
249+
labels:
250+
org.opencontainers.image.source: "https://github.com/nvidia/nvsentinel"
251+
org.opencontainers.image.licenses: "Apache-2.0"
252+
org.opencontainers.image.title: "NVSentinel Slurm Drain Monitor"
253+
org.opencontainers.image.description: "Slurm drain monitor for detecting external Slurm drains and publishing health events"
254+
org.opencontainers.image.version: "{{.Env.VERSION}}"
255+
org.opencontainers.image.revision: "{{.Env.GIT_COMMIT}}"
256+
org.opencontainers.image.created: "{{.Env.BUILD_DATE}}"

distros/kubernetes/nvsentinel/values.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,8 @@ global:
6767
compress: true # Compress rotated log files
6868

6969
nodeSelector: {}
70-
tolerations: []
70+
tolerations:
71+
- operator: Exists
7172
affinity: {}
7273
systemNodeSelector: {}
7374
systemNodeTolerations: []
Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
# ADR-029: Slurm External Drain Health Monitor
2+
3+
## Context
4+
5+
Slurm nodes can be drained by sources other than the slurm-operator (e.g. Slurm's native HealthCheckProgram, admins via `scontrol`, or the native health_check script with `[HC]` reason). The slurm-operator treats any drain reason **not** prefixed with `slurm-operator:` as externally owned and does not modify or clear it. NodeSet pod conditions already reflect the Slurm node state and reason (`SlurmNodeStateDrain` + `Message`).
6+
7+
NVSentinel needs to treat these external Slurm drains as health events so the **full remediation cycle** runs (quarantine → drain → remediate). The DrainRequest is in the **middle** of that pipeline, so we cannot start from a DrainRequest—we must start at the **health event** and publish to the NVSentinel API. The logic is the same as other health monitors; only the **source** (pod conditions) and **publishing endpoint** (NVSentinel API) differ. We also need a **generic** way to parse multi-check drain reasons (e.g. split by a delimiter and regex match) so other consumers can reuse the same contract.
8+
9+
When NVSentinel cordons the node in response to an external drain, it must **not** set the `nodeset.slinky.slurm.net/node-cordon-reason` annotation, so the slurm-operator never overwrites or takes ownership of the external reason.
10+
11+
## Decision
12+
13+
Add a **health monitor** that watches NodeSet pod conditions for external Slurm drains, parses the reason string with a **generic split + regex** contract, and **publishes health events to the NVSentinel API** (same as other monitors). Events enter the normal pipeline (fault-quarantine, node-drainer, fault-remediation). The monitor only publishes events; it does not cordon. When NVSentinel cordons the node downstream (via fault-quarantine / node-drainer) for this event source, it does **not** set the `node-cordon-reason` annotation; cordon node only. This gives the full remediation cycle from a single entry point (health event) and keeps the design reusable via generic parsing.
14+
15+
## Implementation
16+
17+
### 1. Slurm-drain health monitor
18+
19+
**Placement:** `NVSentinel/health-monitors/slurm-drain-monitor/`. Structure follows kubernetes-object-monitor: `main.go`, `pkg/{config,controller,parser,publisher,initializer}`. Go, controller-runtime, same `pb` and gRPC client.
20+
21+
**Detection:**
22+
- **Watch:** NodeSet pods (namespace / label selector). Read `status.conditions`.
23+
- **Detect external drain:** Condition type `SlurmNodeStateDrain` (matches operator constant `PodConditionDrain`), status `True`, and `Message` non-empty and **not** starting with `slurm-operator:` (matches operator constant `nodeReasonPrefix`).
24+
25+
**Parsing (pkg/parser):**
26+
- **Split** `Message` by configurable delimiter (if omitted, the full message is treated as a single segment) → list of segments.
27+
- **Match** each segment against configurable regex rules. Each rule: `regex`, `checkName`, `componentClass`, optional `message`, `recommendedAction`, `isFatal`. If multiple rules match one segment, produce one structured reason per match.
28+
29+
**Health event mapping:**
30+
- Build `pb.HealthEvent` per matched reason: `Agent` = `"slurm-drain-monitor"`, `CheckName`, `Message`, `IsFatal`, `RecommendedAction`, `EntitiesImpacted` = pod (v1/Pod GVK + namespace/name) and node name (from `pod.Spec.NodeName`). Same protos as other monitors.
31+
32+
**Publishing:**
33+
- Call `PlatformConnectorClient.HealthEventOccurredV1` (same API and retry logic as kubernetes-object-monitor). Event flows through fault-quarantine → node-drainer → fault-remediation.
34+
35+
**State transitions:**
36+
- **Unhealthy:** External drain newly detected or reason text changed → publish unhealthy event, store state.
37+
- **Healthy:** External drain clears (condition status `False`, condition removed, `Message` changes to `slurm-operator:`-prefixed, or pod deleted) → publish healthy event, clear state.
38+
- **Deduplication:** Track last-published `message` per pod. Only publish on transition.
39+
40+
**Startup:** Re-scan all matching pods on startup and reconcile against current conditions. Publishing is idempotent so re-publishing existing drains on restart is acceptable.
41+
42+
**Config (TOML):** `namespace`/`labelSelector` for NodeSet pods, `reasonDelimiter`, list of `patterns` (regex + optional fields). Flags: `--platform-connector-socket`, `--processing-strategy`.
43+
44+
### 2. NVSentinel pipeline (downstream)
45+
46+
The monitor only publishes health events; it does not cordon. When fault-quarantine or node-drainer cordons a node for events from the `slurm-drain-monitor` source, the cordon path must **not** set `node-cordon-reason`. This is a configuration / policy constraint in fault-quarantine or node-drainer, not in the monitor.
47+
48+
### 3. Shared parser (future)
49+
50+
The split+regex parser lives in `pkg/parser` within the monitor. If a second consumer appears, extract to a shared NVSentinel package. Until then, keep it colocated.
51+
52+
## Rationale
53+
54+
- **Full cycle:** Starting at the health event gives the full remediation cycle; starting from DrainRequest would be mid-pipeline.
55+
- **Same pattern as other monitors:** Same publishing endpoint and pipeline; only the source (pod conditions) differs. The monitor does not cordon; the "don't set node-cordon-reason" constraint is a downstream pipeline policy.
56+
- **Generic parsing:** Split by delimiter + regex keeps the design reusable for other consumers.
57+
- **No overwrite:** Not setting `node-cordon-reason` when cordoning preserves the slurm-operator's ownership model and avoids losing the audit trail in Slurm.
58+
59+
## Consequences
60+
61+
- **Positive:** Full remediation cycle for external Slurm drains; same NVSentinel API and pipeline; generic parsing usable elsewhere; clear separation between monitor (publish) and pipeline (cordon policy).
62+
- **Negative:** New monitor to deploy and configure.
63+
- **Mitigation:** Config-driven (TOML); follow existing health-monitor patterns (kubernetes-object-monitor).
64+
65+
## Alternatives Considered
66+
67+
- **DrainRequest as entry point:** Rejected because it is mid-pipeline; we need the health event first for the full cycle.
68+
- **Override external reason via node-cordon-reason:** Rejected; would overwrite audit trail and change slurm-operator semantics.
69+
- **CEL-based policy (like kubernetes-object-monitor):** CEL is well-suited for evaluating arbitrary object properties but overkill for string parsing; split+regex is simpler and more explicit for reason parsing.
70+
71+
## References
72+
73+
- kubernetes-object-monitor: `health-monitors/kubernetes-object-monitor` (publisher, controller, config pattern).
74+
- Slurm-operator constants: `SlurmNodeStateDrain` (`PodConditionDrain`), `slurm-operator:` (`nodeReasonPrefix`), `nodeset.slinky.slurm.net/node-cordon-reason` (`AnnotationNodeCordonReason`).

health-monitors/Makefile

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,8 @@ GOCOVER_COBERTURA := gocover-cobertura
1111
GO_HEALTH_MONITORS := \
1212
syslog-health-monitor \
1313
csp-health-monitor \
14-
kubernetes-object-monitor
14+
kubernetes-object-monitor \
15+
slurm-drain-monitor
1516

1617
PYTHON_HEALTH_MONITORS := \
1718
gpu-health-monitor
@@ -59,6 +60,10 @@ lint-test-gpu-health-monitor:
5960
lint-test-kubernetes-object-monitor:
6061
$(MAKE) -C kubernetes-object-monitor lint-test
6162

63+
.PHONY: lint-test-slurm-drain-monitor
64+
lint-test-slurm-drain-monitor:
65+
$(MAKE) -C slurm-drain-monitor lint-test
66+
6267
# Build targets for health monitors (delegate to module Makefiles)
6368
.PHONY: build-all
6469
build-all:
@@ -87,6 +92,10 @@ build-gpu-health-monitor:
8792
build-kubernetes-object-monitor:
8893
$(MAKE) -C kubernetes-object-monitor build
8994

95+
.PHONY: build-slurm-drain-monitor
96+
build-slurm-drain-monitor:
97+
$(MAKE) -C slurm-drain-monitor build
98+
9099
# Clean targets (delegate to module Makefiles)
91100
.PHONY: clean-all
92101
clean-all:
@@ -115,6 +124,10 @@ clean-gpu-health-monitor:
115124
clean-kubernetes-object-monitor:
116125
$(MAKE) -C kubernetes-object-monitor clean
117126

127+
.PHONY: clean-slurm-drain-monitor
128+
clean-slurm-drain-monitor:
129+
$(MAKE) -C slurm-drain-monitor clean
130+
118131
# Help target
119132
.PHONY: help
120133
help:
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# slurm-drain-monitor Makefile
2+
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
# =============================================================================
17+
# MODULE-SPECIFIC CONFIGURATION
18+
# =============================================================================
19+
20+
IS_GO_MODULE := 1
21+
IS_KO_MODULE := 1
22+
CLEAN_EXTRA_FILES := slurm-drain-monitor
23+
24+
# =============================================================================
25+
# INCLUDE SHARED DEFINITIONS
26+
# =============================================================================
27+
28+
include ../../make/common.mk
29+
include ../../make/go.mk
30+
31+
# Test setup commands for kubebuilder envtest
32+
TEST_SETUP_COMMANDS := \
33+
go install sigs.k8s.io/controller-runtime/tools/setup-envtest@$(SETUP_ENVTEST_VERSION) && \
34+
eval $$(setup-envtest use --use-env -p env) &&
35+
36+
# =============================================================================
37+
# DEFAULT TARGET
38+
# =============================================================================
39+
40+
.PHONY: all
41+
all: lint-test
42+
43+
# =============================================================================
44+
# MODULE HELP
45+
# =============================================================================
46+
47+
.PHONY: help
48+
help:
49+
@echo "slurm-drain-monitor - Watches NodeSet pod conditions for external Slurm drains, publishes health events to NVSentinel API"
50+
@echo ""
51+
@echo "Main targets: all, lint-test, build, test, lint, clean"
52+
@echo "Ko targets: ko-build, ko-publish"
53+
@echo ""
54+
@echo "Build notes:"
55+
@echo " - Container images are built using ko"
56+
@echo " - Use 'make ko-build' for local builds (KO_DOCKER_REPO=ko.local by default)"
57+
@echo " - Use 'make ko-publish' to build and push (set KO_DOCKER_REPO and VERSION)"
58+
@echo " - Platforms configured in .ko.yaml (linux/amd64, linux/arm64)"
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# Example configuration for slurm-drain-monitor.
2+
# Copy to /etc/nvsentinel/config/slurm-drain-monitor.toml and adjust.
3+
#
4+
# The monitor watches NodeSet pods for external Slurm drain conditions
5+
# (SlurmNodeStateDrain=True, Message not starting with "slurm-operator:"),
6+
# parses the reason string with split+regex, and publishes health events
7+
# to the NVSentinel API.
8+
9+
namespace = "slurm"
10+
labelSelector = "app.kubernetes.io/name=slurmd,app.kubernetes.io/component=worker"
11+
reasonDelimiter = "; "
12+
13+
[[patterns]]
14+
name = "slurm-healthcheck"
15+
regex = '^\[HC\]'
16+
checkName = "SlurmHealthCheck"
17+
componentClass = "NODE"
18+
isFatal = false
19+
recommendedAction = "CONTACT_SUPPORT"
20+
21+
[[patterns]]
22+
name = "slurm-not-responding"
23+
regex = 'Not responding'
24+
checkName = "SlurmNotResponding"
25+
componentClass = "NODE"
26+
isFatal = true
27+
recommendedAction = "REPLACE_VM"
Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
module github.com/nvidia/nvsentinel/health-monitors/slurm-drain-monitor
2+
3+
go 1.25.4
4+
5+
require (
6+
github.com/go-logr/logr v1.4.3
7+
github.com/nvidia/nvsentinel/commons v0.0.0
8+
github.com/nvidia/nvsentinel/data-models v0.0.0
9+
github.com/prometheus/client_golang v1.23.2
10+
github.com/stretchr/testify v1.11.1
11+
google.golang.org/grpc v1.79.1
12+
google.golang.org/protobuf v1.36.11
13+
k8s.io/api v0.35.2
14+
k8s.io/apimachinery v0.35.2
15+
sigs.k8s.io/controller-runtime v0.22.4
16+
)
17+
18+
require (
19+
github.com/BurntSushi/toml v1.6.0 // indirect
20+
github.com/beorn7/perks v1.0.1 // indirect
21+
github.com/cespare/xxhash/v2 v2.3.0 // indirect
22+
github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc // indirect
23+
github.com/emicklei/go-restful/v3 v3.13.0 // indirect
24+
github.com/evanphx/json-patch/v5 v5.9.11 // indirect
25+
github.com/fsnotify/fsnotify v1.9.0 // indirect
26+
github.com/fxamacker/cbor/v2 v2.9.0 // indirect
27+
github.com/go-openapi/jsonpointer v0.22.3 // indirect
28+
github.com/go-openapi/jsonreference v0.21.3 // indirect
29+
github.com/go-openapi/swag v0.25.4 // indirect
30+
github.com/go-openapi/swag/cmdutils v0.25.4 // indirect
31+
github.com/go-openapi/swag/conv v0.25.4 // indirect
32+
github.com/go-openapi/swag/fileutils v0.25.4 // indirect
33+
github.com/go-openapi/swag/jsonname v0.25.4 // indirect
34+
github.com/go-openapi/swag/jsonutils v0.25.4 // indirect
35+
github.com/go-openapi/swag/loading v0.25.4 // indirect
36+
github.com/go-openapi/swag/mangling v0.25.4 // indirect
37+
github.com/go-openapi/swag/netutils v0.25.4 // indirect
38+
github.com/go-openapi/swag/stringutils v0.25.4 // indirect
39+
github.com/go-openapi/swag/typeutils v0.25.4 // indirect
40+
github.com/go-openapi/swag/yamlutils v0.25.4 // indirect
41+
github.com/gogo/protobuf v1.3.2 // indirect
42+
github.com/google/btree v1.1.3 // indirect
43+
github.com/google/gnostic-models v0.7.1 // indirect
44+
github.com/google/go-cmp v0.7.0 // indirect
45+
github.com/google/uuid v1.6.0 // indirect
46+
github.com/json-iterator/go v1.1.12 // indirect
47+
github.com/modern-go/concurrent v0.0.0-20180306012644-bacd9c7ef1dd // indirect
48+
github.com/modern-go/reflect2 v1.0.3-0.20250322232337-35a7c28c31ee // indirect
49+
github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822 // indirect
50+
github.com/pmezard/go-difflib v1.0.1-0.20181226105442-5d4384ee4fb2 // indirect
51+
github.com/prometheus/client_model v0.6.2 // indirect
52+
github.com/prometheus/common v0.67.4 // indirect
53+
github.com/prometheus/procfs v0.19.2 // indirect
54+
github.com/spf13/pflag v1.0.10 // indirect
55+
github.com/x448/float16 v0.8.4 // indirect
56+
github.com/yandex/protoc-gen-crd v1.1.0 // indirect
57+
go.yaml.in/yaml/v2 v2.4.3 // indirect
58+
go.yaml.in/yaml/v3 v3.0.4 // indirect
59+
golang.org/x/net v0.49.0 // indirect
60+
golang.org/x/oauth2 v0.34.0 // indirect
61+
golang.org/x/sync v0.19.0 // indirect
62+
golang.org/x/sys v0.40.0 // indirect
63+
golang.org/x/term v0.39.0 // indirect
64+
golang.org/x/text v0.33.0 // indirect
65+
golang.org/x/time v0.14.0 // indirect
66+
gomodules.xyz/jsonpatch/v2 v2.4.0 // indirect
67+
google.golang.org/genproto/googleapis/rpc v0.0.0-20260209200024-4cfbd4190f57 // indirect
68+
gopkg.in/evanphx/json-patch.v4 v4.13.0 // indirect
69+
gopkg.in/inf.v0 v0.9.1 // indirect
70+
gopkg.in/yaml.v3 v3.0.1 // indirect
71+
k8s.io/apiextensions-apiserver v0.34.1 // indirect
72+
k8s.io/client-go v0.35.2 // indirect
73+
k8s.io/klog/v2 v2.130.1 // indirect
74+
k8s.io/kube-openapi v0.0.0-20251125145642-4e65d59e963e // indirect
75+
k8s.io/utils v0.0.0-20251002143259-bc988d571ff4 // indirect
76+
sigs.k8s.io/json v0.0.0-20250730193827-2d320260d730 // indirect
77+
sigs.k8s.io/randfill v1.0.0 // indirect
78+
sigs.k8s.io/structured-merge-diff/v6 v6.3.1 // indirect
79+
sigs.k8s.io/yaml v1.6.0 // indirect
80+
)
81+
82+
replace github.com/nvidia/nvsentinel/commons => ../../commons
83+
84+
replace github.com/nvidia/nvsentinel/data-models => ../../data-models

0 commit comments

Comments
 (0)