You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/configuration/labeler.md
+39Lines changed: 39 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,6 +13,10 @@ The labeler automatically manages these node labels:
13
13
|`nvsentinel.dgxc.nvidia.com/dcgm.version`|`3.x`, `4.x`| DCGM major version detected from DCGM pods |
14
14
|`nvsentinel.dgxc.nvidia.com/driver.installed`|`true`, `false`| NVIDIA driver pod status on node |
15
15
|`nvsentinel.dgxc.nvidia.com/kata.enabled`|`true`, `false`| Kata Containers runtime presence |
16
+
|`nvsentinel.dgxc.nvidia.com/gpu.count.current`| non-negative integer | Current GPU count from the configured class expression |
17
+
|`nvsentinel.dgxc.nvidia.com/gpu.count.expected`| non-negative integer | Expected GPU count from override or learned hardware-class baseline |
18
+
|`nvsentinel.dgxc.nvidia.com/nic.count.current`| non-negative integer | Current NIC count from the configured class expression |
19
+
|`nvsentinel.dgxc.nvidia.com/nic.count.expected`| non-negative integer | Expected NIC count from override or learned hardware-class baseline |
16
20
17
21
## Configuration Reference
18
22
@@ -79,6 +83,41 @@ The following label values (case-insensitive) are considered truthy for Kata det
79
83
80
84
Any other value or missing label results in `kata.enabled=false`.
81
85
86
+
## Expected Device Counts
87
+
88
+
Expected device-count labeling is disabled by default. When enabled, the labeler evaluates enabled classes and writes current/expected count labels only when the configured CEL expression returns a valid non-negative integer.
89
+
90
+
The Helm chart renders this values block into a TOML ConfigMap entry and mounts it into the labeler pod. Because expressions are compiled at startup, Helm also annotates the pod template with a checksum so changes to the ConfigMap roll the Deployment.
- `node`: the Kubernetes Node object being reconciled.
116
+
- `resourceSlices`: ResourceSlice objects associated with the node.
117
+
- `sum(list<int>)`: helper that returns the sum of a list of integers.
118
+
119
+
For classes without a matching override, the expected value is learned as the maximum current or existing expected count among nodes with the same configured grouping-label values. Learned expected counts can rise automatically, but do not fall automatically when a node reports fewer devices.
Indicates whether the node is running Kata Containers runtime (detected from node labels).
92
98
99
+
### Expected Device Counts
100
+
**Labels**:
101
+
- `nvsentinel.dgxc.nvidia.com/gpu.count.current`
102
+
- `nvsentinel.dgxc.nvidia.com/gpu.count.expected`
103
+
- `nvsentinel.dgxc.nvidia.com/nic.count.current`
104
+
- `nvsentinel.dgxc.nvidia.com/nic.count.expected`
105
+
106
+
**Values**: non-negative integer strings
107
+
108
+
When enabled, the labeler evaluates configured CEL expressions against the node and associated DRA ResourceSlices. Current labels reflect the observed count. Expected labels come from an override or the maximum learned count among nodes in the same grouping-label partition.
0 commit comments