Skip to content

Export InfiniBandDevice parsing or add device filtering to InfiniBandClass #823

@gugulee

Description

@gugulee

Problem

On NVIDIA B100+ systems, some ConnectX-7 NICs are NVSwitch/Fabric Manager-managed Limited PFs (VPD marked SMDL=SW_MNG). These NICs' hardware registers are owned by Fabric Manager, not the host. Reading any file under /sys/class/infiniband/<dev>/ports/*/counters/ triggers kernel ACCESS_REG(0x805) firmware errors:

mlx5_core 0000:xx:00.0: mlx5_cmd_out_err:834:(pid 8567): ACCESS_REG(0x805) op_mod(0x1) failed,status bad operation(0x2), syndrome (0x9a6171), err(-22)

9 lines per device per scrape, flooding dmesg. See node_exporter#3434 (prometheus/node_exporter#3434) for full context.

Why InfiniBandClass() can't be used

InfiniBandClass() eagerly parses all devices — parseInfiniBandDevice → parseInfiniBandPort → parseInfiniBandCounters — in a single call. The caller has no opportunity to filter devices before counter files are read. By the time the caller iterates over the returned devices, the firmware errors have already been triggered.

The caller can identify FM-managed devices beforehand (by reading /sys/class/infiniband/<dev>/device/subsystem_device, which is world-readable), but cannot skip them because parseInfiniBandDevice is unexported.

Request

Either:

Option A: Export parseInfiniBandDevice (rename to InfiniBandDevice):

func (fs FS) InfiniBandDevice(name string) (*InfiniBandDevice, error)

This lets callers list the directory, filter, and parse per-device.

Option B: Add filtering to InfiniBandClass:

func (fs FS) InfiniBandClass(excludePattern ...string) (InfiniBandClass, error)

Option A is preferable — It is purely additive — InfiniBandClass() is unchanged, so all existing callers are unaffected. New users gain per-device parsing without any migration cost.

Workaround in node_exporter

We are currently using go:linkname to call the unexported parseInfiniBandDevice as a temporary bridge, which is fragile and not sustainable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions