Problem
On NVIDIA B100+ systems, some ConnectX-7 NICs are NVSwitch/Fabric Manager-managed Limited PFs (VPD marked SMDL=SW_MNG). These NICs' hardware registers are owned by Fabric Manager, not the host. Reading any file under /sys/class/infiniband/<dev>/ports/*/counters/ triggers kernel ACCESS_REG(0x805) firmware errors:
mlx5_core 0000:xx:00.0: mlx5_cmd_out_err:834:(pid 8567): ACCESS_REG(0x805) op_mod(0x1) failed,status bad operation(0x2), syndrome (0x9a6171), err(-22)
9 lines per device per scrape, flooding dmesg. See node_exporter#3434 (prometheus/node_exporter#3434) for full context.
Why InfiniBandClass() can't be used
InfiniBandClass() eagerly parses all devices — parseInfiniBandDevice → parseInfiniBandPort → parseInfiniBandCounters — in a single call. The caller has no opportunity to filter devices before counter files are read. By the time the caller iterates over the returned devices, the firmware errors have already been triggered.
The caller can identify FM-managed devices beforehand (by reading /sys/class/infiniband/<dev>/device/subsystem_device, which is world-readable), but cannot skip them because parseInfiniBandDevice is unexported.
Request
Either:
Option A: Export parseInfiniBandDevice (rename to InfiniBandDevice):
func (fs FS) InfiniBandDevice(name string) (*InfiniBandDevice, error)
This lets callers list the directory, filter, and parse per-device.
Option B: Add filtering to InfiniBandClass:
func (fs FS) InfiniBandClass(excludePattern ...string) (InfiniBandClass, error)
Option A is preferable — It is purely additive — InfiniBandClass() is unchanged, so all existing callers are unaffected. New users gain per-device parsing without any migration cost.
Workaround in node_exporter
We are currently using go:linkname to call the unexported parseInfiniBandDevice as a temporary bridge, which is fragile and not sustainable.
Problem
On NVIDIA B100+ systems, some ConnectX-7 NICs are NVSwitch/Fabric Manager-managed Limited PFs (VPD marked SMDL=SW_MNG). These NICs' hardware registers are owned by Fabric Manager, not the host. Reading any file under /sys/class/infiniband/<dev>/ports/*/counters/ triggers kernel ACCESS_REG(0x805) firmware errors:
mlx5_core 0000:xx:00.0: mlx5_cmd_out_err:834:(pid 8567): ACCESS_REG(0x805) op_mod(0x1) failed,status bad operation(0x2), syndrome (0x9a6171), err(-22)9 lines per device per scrape, flooding dmesg. See node_exporter#3434 (prometheus/node_exporter#3434) for full context.
Why InfiniBandClass() can't be used
InfiniBandClass() eagerly parses all devices — parseInfiniBandDevice → parseInfiniBandPort → parseInfiniBandCounters — in a single call. The caller has no opportunity to filter devices before counter files are read. By the time the caller iterates over the returned devices, the firmware errors have already been triggered.
The caller can identify FM-managed devices beforehand (by reading /sys/class/infiniband/<dev>/device/subsystem_device, which is world-readable), but cannot skip them because parseInfiniBandDevice is unexported.
Request
Either:
Option A: Export parseInfiniBandDevice (rename to InfiniBandDevice):
func (fs FS) InfiniBandDevice(name string) (*InfiniBandDevice, error)This lets callers list the directory, filter, and parse per-device.
Option B: Add filtering to InfiniBandClass:
func (fs FS) InfiniBandClass(excludePattern ...string) (InfiniBandClass, error)Option A is preferable — It is purely additive — InfiniBandClass() is unchanged, so all existing callers are unaffected. New users gain per-device parsing without any migration cost.
Workaround in node_exporter
We are currently using go:linkname to call the unexported parseInfiniBandDevice as a temporary bridge, which is fragile and not sustainable.