You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix(nvml): fix hw slowdown events db writes, support simulate hw slowdown flag (#526)
```go
const (
EnvMockAllSuccess = "GPUD_NVML_MOCK_ALL_SUCCESS"
EnvInjectRemapedRowsPending = "GPUD_NVML_INJECT_REMAPPED_ROWS_PENDING"
EnvInjectClockEventsHwSlowdown = "GPUD_NVML_INJECT_CLOCK_EVENTS_HW_SLOWDOWN"
)
```
Tested
```
"component": "accelerator-nvidia-hw-slowdown",
"startTime": "2024-11-15T04:48:56Z",
"endTime": "2025-03-13T11:34:36.809052648Z",
"events": [
{
"time": "2025-03-13T11:34:00Z",
"name": "hw_slowdown",
"type": "Warning",
"message": "GPU-49004c5e-a258-143f-1c5c-319f2db1a1f1: HW Power Brake Slowdown (reducing the core clocks by a factor of 2 or more) is engaged (External Power Brake Assertion being triggered) ('HW Power Brake Slowdown' in nvidia-smi --query) (nvml), GPU-49004c5e-a258-143f-1c5c-319f2db1a1f1: HW Slowdown is engaged due to high temperature, power brake assertion, or high power draw ('HW Slowdown: Active' in nvidia-smi --query) (nvml), GPU-49004c5e-a258-143f-1c5c-319f2db1a1f1: HW Thermal Slowdown (reducing the core clocks by a factor of 2 or more) is engaged (temperature being too high) ('HW Thermal Slowdown' in nvidia-smi --query) (nvml)",
"extra_info": {
"data_source": "nvml",
"gpu_uuid": "GPU-49004c5e-a258-143f-1c5c-319f2db1a1f1"
}
},
{
"time": "2025-03-13T11:33:00Z",
"name": "hw_slowdown",
"type": "Warning",
"message": "GPU-49004c5e-a258-143f-1c5c-319f2db1a1f1: HW Power Brake Slowdown (reducing the core clocks by a factor of 2 or more) is engaged (External Power Brake Assertion being triggered) ('HW Power Brake Slowdown' in nvidia-smi --query) (nvml), GPU-49004c5e-a258-143f-1c5c-319f2db1a1f1: HW Slowdown is engaged due to high temperature, power brake assertion, or high power draw ('HW Slowdown: Active' in nvidia-smi --query) (nvml), GPU-49004c5e-a258-143f-1c5c-319f2db1a1f1: HW Thermal Slowdown (reducing the core clocks by a factor of 2 or more) is engaged (temperature being too high) ('HW Thermal Slowdown' in nvidia-smi --query) (nvml)",
"extra_info": {
"data_source": "nvml",
"gpu_uuid": "GPU-49004c5e-a258-143f-1c5c-319f2db1a1f1"
}
}
]
},
```
---------
Signed-off-by: Gyuho Lee <[email protected]>
Co-authored-by: Joseph <[email protected]>
0 commit comments