Commit aa0d5c8
committed
feat: add buffered health channel and device state tracking
Add buffered health channel to prevent the health check goroutine from
blocking when ListAndWatch is slow to consume events. This addresses
stability issues with multiple GPUs and bursty XID scenarios.
Changes:
- Add healthChannelBufferSize constant (64) for burst handling
- Create buffered health channel in initialize()
- Add enhanced logging for unhealthy device reports with reason
- Add MarkUnhealthy() to Device for tracking failure reason/timestamp
- Add IsUnhealthy() and UnhealthyDuration() for diagnostics
- Add UnhealthyReason and LastUnhealthyTime fields to Device struct
The buffer size of 64 provides headroom for 8 GPUs with multiple events
per GPU while using a power-of-2 size for cache-friendly alignment.
Devices marked unhealthy remain in that state until external intervention
(node drain, GPU reset, reboot) - the device plugin does not attempt
auto-recovery as that decision belongs to external components like DCGM
or Node Problem Detector.
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>1 parent b26865d commit aa0d5c8
2 files changed
Lines changed: 42 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
46 | 46 | | |
47 | 47 | | |
48 | 48 | | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
49 | 57 | | |
50 | 58 | | |
51 | 59 | | |
| |||
108 | 116 | | |
109 | 117 | | |
110 | 118 | | |
111 | | - | |
| 119 | + | |
112 | 120 | | |
113 | 121 | | |
114 | 122 | | |
| |||
263 | 271 | | |
264 | 272 | | |
265 | 273 | | |
266 | | - | |
| 274 | + | |
| 275 | + | |
267 | 276 | | |
268 | 277 | | |
269 | 278 | | |
| |||
274 | 283 | | |
275 | 284 | | |
276 | 285 | | |
277 | | - | |
278 | 286 | | |
279 | | - | |
| 287 | + | |
| 288 | + | |
280 | 289 | | |
281 | 290 | | |
282 | 291 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
20 | 20 | | |
21 | 21 | | |
22 | 22 | | |
| 23 | + | |
23 | 24 | | |
24 | 25 | | |
25 | 26 | | |
| |||
35 | 36 | | |
36 | 37 | | |
37 | 38 | | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
38 | 43 | | |
39 | 44 | | |
40 | 45 | | |
| |||
239 | 244 | | |
240 | 245 | | |
241 | 246 | | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
242 | 271 | | |
243 | 272 | | |
244 | 273 | | |
| |||
0 commit comments