Skip to content

feature request: allow suppressing "The device error log contains records of errors" log warnings for errors that happened a long time in the past #281

Open
@zackelan

Description

@zackelan

I have an SSD (full JSON output attached below) that has errors recorded in its log:

SMART Extended Comprehensive Error Log Version: 1 (1 sectors)
Device Error Count: 18 (device log contains only the most recent 4 errors)
...
Error 18 occurred at disk power-on lifetime: 1185 hours (49 days + 9 hours)
Error 17 occurred at disk power-on lifetime: 1185 hours (49 days + 9 hours)
Error 16 occurred at disk power-on lifetime: 1185 hours (49 days + 9 hours)
Error 15 occurred at disk power-on lifetime: 1185 hours (49 days + 9 hours)
Error 14 occurred at disk power-on lifetime: 1185 hours (49 days + 9 hours)

which causes smartctl_exporter to log a warning message every time it polls the drive:

time=2025-03-23T19:04:38.630Z level=WARN source=readjson.go:71 msg="S.M.A.R.T. output reading" err="exit status 64" device="/dev/sda;auto (sda)"
time=2025-03-23T19:04:38.630Z level=WARN source=readjson.go:151 msg="The device error log contains records of errors" device="/dev/sda;auto (sda)"

however, my drive's Power_On_Hours is currently over 31,000 - the errors that were recorded in the device log happened over 3 years ago. the drive has passed numerous scheduled self-tests since then, so whatever the error was seems to have been transient and not an indication of a drive that's about to die.

these warnings are harmless, but they're also unnecessary log noise that I'd like the option to suppress.

some options I can think of:

  • something along the lines of --ignore-device-log-errors-older-than <duration> or --ignore-device-log-errors-from <device serial number>

  • log this message only once per device, and then suppress the message for that device (at least until smartctl_exporter is restarted)

  • log this message only when the smartctl_device_error_log_count metric increases for a given device (this matches the actual monitoring rule I have in place, alerting on increase(smartctl_device_error_log_count))

sat-Marvell_based_SanDisk_SSDs-SanDisk_SD5SG2128G1052E-sda.json

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions