Skip to content

CD daemon: process manager: backoff on daemon restarts #673

@jgehrcke

Description

@jgehrcke

For config failures and similar, right now we busy-spin on daemon restarts managed by process manager.

Let's introduce an adaptive wait time here:

https://github.com/NVIDIA/k8s-dra-driver-gpu/blob/7b5e2cd428f3c654812ec9d21347437bb6f099e7/cmd/compute-domain-daemon/process.go#L195

Example, showing unreasonably high rate:

W1014 09:04:42.467224       1 process.go:189] Watchdog: child terminated unexpectedly
[...]
W1014 09:04:42.467236       1 process.go:195] Watchdog: start process again
[...]
W1014 09:04:43.467276       1 process.go:189] Watchdog: child terminated unexpectedly
[...]
W1014 09:04:44.467334       1 process.go:195] Watchdog: start process again

Metadata

Metadata

Assignees

Labels

lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.robustnessissue/pr: edge cases & fault tolerance

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions