Skip to content

Commit 5ad8969

Browse files
feat(oom_watchdog): make load threshold scale automatically with CPU count
Replace fixed max_load_1 with per-core threshold that multiplies by ansible_processor_vcpus to adapt to different host sizes
1 parent 2659e6b commit 5ad8969

3 files changed

Lines changed: 4 additions & 4 deletions

File tree

roles/oom_watchdog/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ This role was developed for bare-metal GPU hosts running heavy workloads (`ere-s
2929
| `oom_watchdog_timeout` | `60` | Seconds the kernel watchdog waits without a ping before forcing a reboot. |
3030
| `oom_watchdog_interval` | `30` | Seconds between liveness checks performed by the watchdog daemon. Must be less than `oom_watchdog_timeout`. |
3131
| `oom_watchdog_min_memory` | `1` | Reboot when fewer than this many free pages are available — catches runaway-memory lockups before the OOM killer can act. |
32-
| `oom_watchdog_max_load_1` | `24` | Reboot when the 1-minute load average exceeds this value. Tune up on high-core-count GPU hosts. |
32+
| `oom_watchdog_max_load_1_per_core` | `4` | Per-core 1-minute load threshold. The rendered `max-load-1` is this value multiplied by `ansible_processor_vcpus`, so the threshold scales automatically with host size (e.g. `4` × 32 vCPUs ⇒ `max-load-1 = 128`). |
3333
| `oom_watchdog_realtime` | `true` | Run the daemon with realtime scheduling so it keeps pinging the watchdog under heavy load. |
3434
| `oom_watchdog_priority` | `1` | Realtime priority used when `oom_watchdog_realtime` is enabled. |
3535
| `oom_watchdog_nmi_enabled` | `true` | Set `kernel.nmi_watchdog = 1` so the kernel itself can detect hard lockups. Disable on hosts where the NMI watchdog is unwanted (e.g. some virtualized environments). |
@@ -43,7 +43,7 @@ This role was developed for bare-metal GPU hosts running heavy workloads (`ere-s
4343
vars:
4444
oom_watchdog_timeout: 90
4545
oom_watchdog_interval: 30
46-
oom_watchdog_max_load_1: 48
46+
oom_watchdog_max_load_1_per_core: 6
4747
```
4848
4949
## Verification

roles/oom_watchdog/defaults/main.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ oom_watchdog_device: /dev/watchdog
33
oom_watchdog_timeout: 60
44
oom_watchdog_interval: 30
55
oom_watchdog_min_memory: 1
6-
oom_watchdog_max_load_1: 24
6+
oom_watchdog_max_load_1_per_core: 4
77
oom_watchdog_realtime: true
88
oom_watchdog_priority: 1
99
oom_watchdog_nmi_enabled: true

roles/oom_watchdog/templates/watchdog.conf.j2

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,6 @@ watchdog-device = {{ oom_watchdog_device }}
33
watchdog-timeout = {{ oom_watchdog_timeout }}
44
interval = {{ oom_watchdog_interval }}
55
min-memory = {{ oom_watchdog_min_memory }}
6-
max-load-1 = {{ oom_watchdog_max_load_1 }}
6+
max-load-1 = {{ (oom_watchdog_max_load_1_per_core * ansible_processor_vcpus) | int }}
77
realtime = {{ 'yes' if oom_watchdog_realtime else 'no' }}
88
priority = {{ oom_watchdog_priority }}

0 commit comments

Comments
 (0)