For config failures and similar, right now we busy-spin on daemon restarts managed by process manager.
Let's introduce an adaptive wait time here:
https://github.com/NVIDIA/k8s-dra-driver-gpu/blob/7b5e2cd428f3c654812ec9d21347437bb6f099e7/cmd/compute-domain-daemon/process.go#L195
Example, showing unreasonably high rate:
W1014 09:04:42.467224 1 process.go:189] Watchdog: child terminated unexpectedly
[...]
W1014 09:04:42.467236 1 process.go:195] Watchdog: start process again
[...]
W1014 09:04:43.467276 1 process.go:189] Watchdog: child terminated unexpectedly
[...]
W1014 09:04:44.467334 1 process.go:195] Watchdog: start process again
For config failures and similar, right now we busy-spin on daemon restarts managed by process manager.
Let's introduce an adaptive wait time here:
https://github.com/NVIDIA/k8s-dra-driver-gpu/blob/7b5e2cd428f3c654812ec9d21347437bb6f099e7/cmd/compute-domain-daemon/process.go#L195
Example, showing unreasonably high rate: