Device open failed, device did not return an IDENTIFY DEVICE structure, #91
Description
Oct 27 04:45:22 host smart-exporter[19557]: ts=2022-10-27T04:45:22.306Z caller=readjson.go:69 level=warn msg="S.M.A.R.T. output reading" err="exit status 2"
Oct 27 04:45:22 host smart-exporter[19557]: ts=2022-10-27T04:45:22.306Z caller=readjson.go:122 level=error msg="Device open failed, device did not return an IDENTIFY DEVICE structure, or device is in a low-power mode"
Oct 27 04:45:22 host smart-exporter[19557]: ts=2022-10-27T04:45:22.306Z caller=main.go:57 level=error msg="Error collecting SMART data" err="smartctl returned bad data for device /dev/sdb"
Oct 27 04:45:22 host smart-exporter[19557]: ts=2022-10-27T04:45:22.306Z caller=main.go:57 level=error msg="Error collecting SMART data" err="Device /dev/bus/0 unavialable"
Oct 27 04:45:22 host smart-exporter[19557]: ts=2022-10-27T04:45:22.306Z caller=main.go:57 level=error msg="Error collecting SMART data" err="Device /dev/bus/0 unavialable"
/usr/local/bin/smartctl_exporter --version
smartctl_exporter, version 0.9.0 (branch: HEAD, revision: 0f32489b4018a21747109a33d7297c1ed85e10ab)
build user: root@f07a6d7b35c8
build date: 20221020-16:19:31
go version: go1.18.7
platform: linux/amd64
constantly seing NVMe drives fail due to heavy load
Usually will see something like the below in dmesg
But seems smartctl_exporter doesn't pick up any of this? (could be the smart tool itself too)
At least it should should report some kind of error no if it can't scan the drive?
(metrics are not reset to 0 when the exporter can't scan again?)
[Wed Oct 26 06:18:54 2022] nvme nvme0: Abort status: 0x0
[Wed Oct 26 06:18:59 2022] nvme nvme0: I/O 718 QID 11 timeout, aborting
[Wed Oct 26 06:19:00 2022] nvme nvme0: Abort status: 0x0
[Wed Oct 26 06:19:00 2022] nvme nvme0: Abort status: 0x0
[Wed Oct 26 06:19:00 2022] nvme nvme0: Abort status: 0x0
[Wed Oct 26 06:19:00 2022] nvme nvme0: Abort status: 0x0
[Wed Oct 26 06:19:31 2022] nvme nvme0: I/O 529 QID 34 timeout, aborting
[Wed Oct 26 06:19:31 2022] nvme nvme0: I/O 530 QID 34 timeout, aborting
[Wed Oct 26 06:19:31 2022] nvme nvme0: I/O 544 QID 34 timeout, aborting
[Wed Oct 26 06:19:31 2022] nvme nvme0: I/O 545 QID 34 timeout, aborting
...
[Wed Oct 26 06:20:17 2022] nvme nvme0: Abort status: 0x0
[Wed Oct 26 06:20:19 2022] nvme nvme0: Abort status: 0x0
[Wed Oct 26 06:20:19 2022] nvme nvme0: Abort status: 0x0
[Wed Oct 26 06:20:19 2022] blk_update_request: I/O error, dev nvme0n1, sector 1875858760 op 0x1:(WRITE) flags 0x1800 phys_seg 1 prio class 0
[Wed Oct 26 06:20:19 2022] XFS (nvme0n1p1): log I/O error -5
[Wed Oct 26 06:20:19 2022] XFS (nvme0n1p1): xfs_do_force_shutdown(0x2) called from line 1250 of file fs/xfs/xfs_log.c. Return address = 00000000dbc93c6d
[Wed Oct 26 06:20:19 2022] XFS (nvme0n1p1): Log I/O Error Detected. Shutting down filesystem
[Wed Oct 26 06:20:19 2022] XFS (nvme0n1p1): Please unmount the filesystem and rectify the problem(s)
[Wed Oct 26 06:20:19 2022] nvme nvme0: Abort status: 0x0
curl localhost:9633/metrics -s | grep crit | grep -v "#"
critical_warning{device="/dev/nvme0",model_family="",model_name="Dell Ent NVMe CM6 RI 1.92TB",serial_number="6150A061TCD8"} 0
critical_warning{device="/dev/nvme1",model_family="",model_name="Dell Ent NVMe CM6 RI 1.92TB",serial_number="51B0A02UTCD8"} 0
smartctl_device_critical_warning{device="/dev/nvme0",model_family="",model_name="Dell Ent NVMe CM6 RI 1.92TB",serial_number="6150A061TCD8"} 0
smartctl_device_critical_warning{device="/dev/nvme1",model_family="",model_name="Dell Ent NVMe CM6 RI 1.92TB",serial_number="51B0A02UTCD8"} 0
curl localhost:9633/metrics -s | grep err | grep -v "#"
media_errors{device="/dev/nvme0",model_family="",model_name="Dell Ent NVMe CM6 RI 1.92TB",serial_number="6150A061TCD8"} 0
media_errors{device="/dev/nvme1",model_family="",model_name="Dell Ent NVMe CM6 RI 1.92TB",serial_number="51B0A02UTCD8"} 0
smartctl_device_media_errors{device="/dev/nvme0",model_family="",model_name="Dell Ent NVMe CM6 RI 1.92TB",serial_number="6150A061TCD8"} 0
smartctl_device_media_errors{device="/dev/nvme1",model_family="",model_name="Dell Ent NVMe CM6 RI 1.92TB",serial_number="51B0A02UTCD8"} 0
smartctl_device_num_err_log_entries{device="/dev/nvme0",model_family="",model_name="Dell Ent NVMe CM6 RI 1.92TB",serial_number="6150A061TCD8"} 113
smartctl_device_num_err_log_entries{device="/dev/nvme1",model_family="",model_name="Dell Ent NVMe CM6 RI 1.92TB",serial_number="51B0A02UTCD8"} 209
Also small nitpicks:
- typo: unavialable
- there should be a metric with smartctl_exporter version ?
Metadata
Assignees
Labels
No labels