Skip to content

Use ip commands to bypass netlink PAGE_SIZE limit for InfiniBand devices#1026

Open
bbenshab wants to merge 6 commits intok8snetworkplumbingwg:masterfrom
bbenshab:fix/infiniband-netlink-buffer-size
Open

Use ip commands to bypass netlink PAGE_SIZE limit for InfiniBand devices#1026
bbenshab wants to merge 6 commits intok8snetworkplumbingwg:masterfrom
bbenshab:fix/infiniband-netlink-buffer-size

Conversation

@bbenshab
Copy link

@bbenshab bbenshab commented Feb 7, 2026

Problem

When configuring InfiniBand devices with 16 VFs, the SR-IOV Network Operator fails with "message too long" errors:

ERROR configSriovVFDevices(): unable to get PF link for device: message too long

Root Cause

This is caused by a kernel regression (see OCPBUGS-74889 and RHEL-145522) where the Linux kernel limits individual netlink messages to PAGE_SIZE (~4KB on x86_64).

InfiniBand devices with 16 VFs return metadata (MAC, VLAN, Node GUID, Port GUID, RDMA stats per VF) that exceeds this 4KB limit. The kernel enforces this limit at the message level, making it impossible to query link information via netlink for IB devices with many VFs.

Note: While this is a kernel regression that should ultimately be fixed in the kernel, we cannot wait for a kernel fix as we need to support InfiniBand devices now. This workaround provides a working solution that is isolated to IB device handling.

Solution

Skip netlink queries entirely for InfiniBand devices and use direct ip link set commands to configure VF GUIDs:

  1. configSriovVFDevices(): Skip LinkByName() for IB devices, set pfLink=nil
  2. VF GUID configuration: Call configureVfGUIDWithIPCommand() when pfLink=nil
  3. New function: configureVfGUIDWithIPCommand() generates random GUIDs and uses ip link set commands directly
  4. Set PF link up: Use sysfs + ip link set for IB devices, netlink for non-IB devices

Why this works:

  • ip link set commands use netlink SETLINK operations that don't query VF data
  • Only configuration operations are performed, no large responses returned
  • Reading link state from sysfs avoids netlink queries
  • Kernel accepts configuration commands without hitting PAGE_SIZE limit

Changes

+ Added imports: os/exec, crypto/rand
+ Modified configSriovVFDevices(): Skip netlink for IB, set pfLink=nil
+ Modified VF GUID config: Use ip commands when pfLink=nil  
+ New function configureVfGUIDWithIPCommand(): Direct ip command execution
+ Modified "Set PF link up": Use sysfs/ip for IB, netlink for others

Testing

Environment:

  • Hardware: NVIDIA ConnectX-7 InfiniBand adapters
  • Configuration: 12 NICs per node, 16 VFs per NIC
  • Platform: OpenShift 4.20, RHCOS 9.6
  • Kernel: 5.14.0-570.73.1.el9_6.x86_64

References

Note on Kernel Regression

While this issue originates from a kernel regression, we cannot wait for a kernel fix because:

  1. InfiniBand support is required now for production deployments
  2. Kernel fixes take months to propagate through releases
  3. This workaround is clean, isolated, and maintainable
  4. It will continue to work even after kernel is fixed

This PR provides a working solution that unblocks InfiniBand deployments while the kernel team addresses the underlying regression.

@github-actions
Copy link

github-actions bot commented Feb 7, 2026

Thanks for your PR,
To run vendors CIs, Maintainers can use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs, Maintainers can use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@gemini-code-assist
Copy link

Summary of Changes

Hello @bbenshab, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical issue where InfiniBand devices with multiple Virtual Functions (VFs) fail to configure due to a kernel regression limiting netlink message size. The solution implements a robust workaround by completely bypassing netlink for InfiniBand devices, opting instead for direct 'ip' command execution and sysfs interactions to manage VF GUIDs and PF link states. This ensures InfiniBand device functionality without waiting for a kernel fix.

Highlights

  • InfiniBand Netlink Bypass: InfiniBand device configuration now bypasses standard netlink queries to circumvent a kernel PAGE_SIZE limit, which previously caused "message too long" errors for devices with many VFs.
  • Direct 'ip' Command Usage: Virtual Function (VF) GUIDs for InfiniBand devices are now configured using direct 'ip link set' commands instead of netlink, specifically through a new helper function.
  • PF Link State Management: The process for setting the Physical Function (PF) link to an 'up' state has been updated for InfiniBand devices to use sysfs for status checks and 'ip link set' for state changes, while non-InfiniBand devices continue to use netlink.
  • New Helper Function: A new function, 'configureVfGUIDWithIPCommand', has been introduced to generate random GUIDs and apply them to VFs using 'ip link set' commands.
Changelog
  • pkg/host/internal/sriov/sriov.go
    • Added 'os/exec' and 'crypto/rand' imports for executing external commands and generating random GUIDs.
    • Modified 'configSriovVFDevices' to introduce conditional logic that skips netlink 'LinkByName' calls for InfiniBand devices, setting 'pfLink' to 'nil' in such cases.
    • Updated the VF GUID configuration logic within 'configSriovVFDevices' to invoke the new 'configureVfGUIDWithIPCommand' function when 'pfLink' is 'nil' (i.e., for InfiniBand devices).
    • Revised 'configSriovDevice' to handle setting the PF link up: for InfiniBand devices, it now reads link state from '/sys/class/net/<iface.Name>/operstate' and uses 'ip link set' if the link is not 'up'; for other device types, it continues to use netlink.
    • Introduced 'configureVfGUIDWithIPCommand', a new private method responsible for generating an 8-byte random GUID and applying it as both 'node_guid' and 'port_guid' to a specified VF using 'ip link set' commands.
Activity
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses a kernel regression for InfiniBand devices by bypassing problematic netlink queries and using ip commands instead. The changes are well-contained and clearly explained. I have a couple of suggestions to enhance code robustness and maintainability by using standard library functions for path and data manipulation, and by refactoring a new function to reduce duplication.

// For InfiniBand devices, use sysfs to avoid netlink PAGE_SIZE limit
if strings.EqualFold(iface.LinkType, consts.LinkTypeIB) {
// Check link state via sysfs
adminStatePath := fmt.Sprintf("/sys/class/net/%s/operstate", iface.Name)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using filepath.Join is a safer way to construct file paths, as it handles path separators correctly. While fmt.Sprintf is likely safe here for a sysfs path, filepath.Join is more robust and considered a best practice for path construction.

Suggested change
adminStatePath := fmt.Sprintf("/sys/class/net/%s/operstate", iface.Name)
adminStatePath := filepath.Join("/sys/class/net", iface.Name, "operstate")

Comment on lines +1298 to +1321
// Generate random GUID (8 bytes for InfiniBand)
guid := make([]byte, 8)
_, err := rand.Read(guid)
if err != nil {
return fmt.Errorf("failed to generate random GUID: %v", err)
}

// Format GUID as hex string with colons
guidStr := fmt.Sprintf("%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x",
guid[0], guid[1], guid[2], guid[3], guid[4], guid[5], guid[6], guid[7])

log.Log.Info("configureVfGUIDWithIPCommand(): set vf guid", "address", vfAddr, "guid", guidStr)

// Set node GUID using ip command
cmd := exec.Command("ip", "link", "set", "dev", pfName, "vf", strconv.Itoa(vfID), "node_guid", guidStr)
if output, err := cmd.CombinedOutput(); err != nil {
return fmt.Errorf("failed to set node_guid: %v, output: %s", err, string(output))
}

// Set port GUID using ip command
cmd = exec.Command("ip", "link", "set", "dev", pfName, "vf", strconv.Itoa(vfID), "port_guid", guidStr)
if output, err := cmd.CombinedOutput(); err != nil {
return fmt.Errorf("failed to set port_guid: %v, output: %s", err, string(output))
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This function can be simplified and made more robust.

  1. The GUID formatting can be done more idiomatically using net.HardwareAddr(guid).String(). This requires importing the net package.
  2. The two ip link set commands for node_guid and port_guid are very similar and can be consolidated into a loop to reduce code duplication and improve maintainability.
Suggested change
// Generate random GUID (8 bytes for InfiniBand)
guid := make([]byte, 8)
_, err := rand.Read(guid)
if err != nil {
return fmt.Errorf("failed to generate random GUID: %v", err)
}
// Format GUID as hex string with colons
guidStr := fmt.Sprintf("%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x",
guid[0], guid[1], guid[2], guid[3], guid[4], guid[5], guid[6], guid[7])
log.Log.Info("configureVfGUIDWithIPCommand(): set vf guid", "address", vfAddr, "guid", guidStr)
// Set node GUID using ip command
cmd := exec.Command("ip", "link", "set", "dev", pfName, "vf", strconv.Itoa(vfID), "node_guid", guidStr)
if output, err := cmd.CombinedOutput(); err != nil {
return fmt.Errorf("failed to set node_guid: %v, output: %s", err, string(output))
}
// Set port GUID using ip command
cmd = exec.Command("ip", "link", "set", "dev", pfName, "vf", strconv.Itoa(vfID), "port_guid", guidStr)
if output, err := cmd.CombinedOutput(); err != nil {
return fmt.Errorf("failed to set port_guid: %v, output: %s", err, string(output))
}
// Generate random GUID (8 bytes for InfiniBand)
guid := make([]byte, 8)
_, err := rand.Read(guid)
if err != nil {
return fmt.Errorf("failed to generate random GUID: %v", err)
}
// Format GUID as hex string with colons. Note: requires importing "net".
guidStr := net.HardwareAddr(guid).String()
log.Log.Info("configureVfGUIDWithIPCommand(): set vf guid", "address", vfAddr, "guid", guidStr)
for _, guidType := range []string{"node_guid", "port_guid"} {
cmd := exec.Command("ip", "link", "set", "dev", pfName, "vf", strconv.Itoa(vfID), guidType, guidStr)
if output, err := cmd.CombinedOutput(); err != nil {
return fmt.Errorf("failed to set %s: %v, output: %s", guidType, err, string(output))
}
}

@bbenshab
Copy link
Author

bbenshab commented Feb 7, 2026

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a workaround for a kernel netlink PAGE_SIZE limitation affecting InfiniBand devices with many VFs. The approach is to bypass netlink queries for IB devices and use ip link set commands directly for configuration, which is a sensible solution to unblock users.

The implementation is solid, but I've identified a few areas for improvement:

  • A regression in functionality where the GUID pool for VFs is bypassed in the new code path.
  • Inconsistent file path construction that could cause issues in chrooted environments.
  • Opportunities to improve error handling by using Go's standard error wrapping.

I've left specific comments with suggestions to address these points. Once these are addressed, the PR should be in good shape.

Comment on lines +682 to +696
if strings.EqualFold(iface.LinkType, consts.LinkTypeIB) {
// Check link state via sysfs
adminStatePath := filepath.Join("/sys/class/net", iface.Name, "operstate")
data, err := os.ReadFile(adminStatePath)
if err != nil {
return fmt.Errorf("failed to read link state: %v", err)
}
state := strings.TrimSpace(string(data))
// Set link up if not already up (operstate can be "up", "down", "unknown", etc.)
if state != "up" && state != "unknown" {
cmd := exec.Command("ip", "link", "set", iface.Name, "up")
if output, err := cmd.CombinedOutput(); err != nil {
return fmt.Errorf("failed to set link up: %v, output: %s", err, string(output))
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This block has a few areas for improvement:

  1. File Path Construction: The path to operstate is constructed with a hardcoded "/sys/class/net" and without vars.FilesystemRoot. This is inconsistent with other parts of the codebase and could lead to issues in chrooted or testing environments. It should use filepath.Join(vars.FilesystemRoot, consts.SysClassNet, ...). The constant consts.SysClassNet is available.
  2. Error Wrapping: The errors from os.ReadFile and cmd.CombinedOutput are not wrapped using %w. Wrapping errors provides more context for debugging.

Here's a suggested implementation that addresses these points:

if strings.EqualFold(iface.LinkType, consts.LinkTypeIB) {
	// Check link state via sysfs
	adminStatePath := filepath.Join(vars.FilesystemRoot, consts.SysClassNet, iface.Name, "operstate")
	data, err := os.ReadFile(adminStatePath)
	if err != nil {
		return fmt.Errorf("failed to read link state for %s: %w", iface.Name, err)
	}
	state := strings.TrimSpace(string(data))
	// Set link up if not already up (operstate can be "up", "down", "unknown", etc.)
	if state != "up" && state != "unknown" {
		cmd := exec.Command("ip", "link", "set", iface.Name, "up")
		if output, err := cmd.CombinedOutput(); err != nil {
			return fmt.Errorf("failed to set link %s up: %w, output: %s", iface.Name, err, string(output))
		}
	}
}

Comment on lines +1295 to +1320
func (s *sriov) configureVfGUIDWithIPCommand(vfAddr string, pfAddr string, pfName string, vfID int) error {
log.Log.V(2).Info("configureVfGUIDWithIPCommand(): configure vf guid using ip command",
"vfAddr", vfAddr, "pfAddr", pfAddr, "pfName", pfName, "vfID", vfID)

// Generate random GUID (8 bytes for InfiniBand)
guid := make([]byte, 8)
_, err := rand.Read(guid)
if err != nil {
return fmt.Errorf("failed to generate random GUID: %v", err)
}

// Format GUID as hex string with colons using net.HardwareAddr
guidStr := net.HardwareAddr(guid).String()

log.Log.Info("configureVfGUIDWithIPCommand(): set vf guid", "address", vfAddr, "guid", guidStr)

// Set both node_guid and port_guid using ip command
for _, guidType := range []string{"node_guid", "port_guid"} {
cmd := exec.Command("ip", "link", "set", "dev", pfName, "vf", strconv.Itoa(vfID), guidType, guidStr)
if output, err := cmd.CombinedOutput(); err != nil {
return fmt.Errorf("failed to set %s: %v, output: %s", guidType, err, string(output))
}
}

return nil
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This function always generates a random GUID, which bypasses the GUID pool logic present in the netlink-based configuration path (s.infinibandHelper.ConfigureVfGUID). If a user has configured a GUID pool via InfinibandGUIDConfigFilePath, it will be ignored for InfiniBand devices when this workaround is active. This is a regression in functionality.

To address this, the GUID acquisition logic (checking the pool or generating a random one) should be centralized. A potential solution is to move this function's logic into the infinibandHelper and have it access the guidPool before falling back to random generation. This would ensure consistent behavior across both configuration paths.

Comment on lines +1301 to +1317
_, err := rand.Read(guid)
if err != nil {
return fmt.Errorf("failed to generate random GUID: %v", err)
}

// Format GUID as hex string with colons using net.HardwareAddr
guidStr := net.HardwareAddr(guid).String()

log.Log.Info("configureVfGUIDWithIPCommand(): set vf guid", "address", vfAddr, "guid", guidStr)

// Set both node_guid and port_guid using ip command
for _, guidType := range []string{"node_guid", "port_guid"} {
cmd := exec.Command("ip", "link", "set", "dev", pfName, "vf", strconv.Itoa(vfID), guidType, guidStr)
if output, err := cmd.CombinedOutput(); err != nil {
return fmt.Errorf("failed to set %s: %v, output: %s", guidType, err, string(output))
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The errors returned from rand.Read and cmd.CombinedOutput are not being wrapped. Using %w with fmt.Errorf is a Go best practice that preserves the original error, which is valuable for debugging and error handling further up the call stack.

	_, err := rand.Read(guid)
	if err != nil {
		return fmt.Errorf("failed to generate random GUID: %w", err)
	}

	// Format GUID as hex string with colons using net.HardwareAddr
	guidStr := net.HardwareAddr(guid).String()

	log.Log.Info("configureVfGUIDWithIPCommand(): set vf guid", "address", vfAddr, "guid", guidStr)

	// Set both node_guid and port_guid using ip command
	for _, guidType := range []string{"node_guid", "port_guid"} {
		cmd := exec.Command("ip", "link", "set", "dev", pfName, "vf", strconv.Itoa(vfID), guidType, guidStr)
		if output, err := cmd.CombinedOutput(); err != nil {
			return fmt.Errorf("failed to set %s: %w, output: %s", guidType, err, string(output))
		}
	}

@bbenshab
Copy link
Author

bbenshab commented Feb 7, 2026

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a workaround for a kernel bug affecting InfiniBand devices by using ip commands instead of netlink to avoid PAGE_SIZE limitations. The changes are well-structured and properly isolated to InfiniBand device handling. I have a couple of suggestions to improve logging and error handling for better maintainability and debugging.

Comment on lines 45 to 48
if err != nil {
log.Log.Info("ConfigureVfGUID(): failed to get GUID from IB GUID pool", "address", vfAddr, "error", err)
return err
log.Log.Info("GetVfGUID(): failed to get GUID from IB GUID pool, using random GUID", "pfAddr", pfAddr, "vfID", vfID, "error", err)
return nil, err
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The log message at line 46 is misleading. It states "using random GUID", but the function returns an error instead of falling back to the randomly generated GUID. This can make debugging difficult. The log level should also be Error since an error is being handled and returned, which is a better practice for error logging.

Suggested change
if err != nil {
log.Log.Info("ConfigureVfGUID(): failed to get GUID from IB GUID pool", "address", vfAddr, "error", err)
return err
log.Log.Info("GetVfGUID(): failed to get GUID from IB GUID pool, using random GUID", "pfAddr", pfAddr, "vfID", vfID, "error", err)
return nil, err
}
if err != nil {
log.Log.Error(err, "GetVfGUID(): failed to get GUID from IB GUID pool", "pfAddr", pfAddr, "vfID", vfID)
return nil, err
}

if state != "up" && state != "unknown" {
cmd := exec.Command("ip", "link", "set", iface.Name, "up")
if output, err := cmd.CombinedOutput(); err != nil {
return fmt.Errorf("failed to set link up: %v, output: %s", err, string(output))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency with other error handling in this file (e.g., line 1312) and to follow Go best practices for error wrapping, please use %w instead of %v. This allows the error to be unwrapped using errors.Is or errors.As, which can be useful for programmatic error handling.

Suggested change
return fmt.Errorf("failed to set link up: %v, output: %s", err, string(output))
return fmt.Errorf("failed to set link up: %w, output: %s", err, string(output))

@bbenshab
Copy link
Author

bbenshab commented Feb 7, 2026

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a workaround for a kernel regression affecting InfiniBand devices by using ip commands instead of netlink queries to avoid message size limits. The changes are well-structured and correctly isolate the logic for InfiniBand devices. However, I've identified a critical issue in the random GUID generation logic. It uses math/rand without proper seeding, which will lead to predictable GUIDs and likely cause collisions in a distributed environment. I've provided a detailed comment and a code suggestion to address this by using the cryptographically secure crypto/rand package.

Comment on lines +40 to +53
func (i *infiniband) GetVfGUID(pfAddr string, vfID int) (net.HardwareAddr, error) {
guid := generateRandomGUID()

if i.guidPool != nil {
guidFromPool, err := i.guidPool.GetVFGUID(pfAddr, vfID)
if err != nil {
log.Log.Info("ConfigureVfGUID(): failed to get GUID from IB GUID pool", "address", vfAddr, "error", err)
return err
log.Log.Error(err, "GetVfGUID(): failed to get GUID from IB GUID pool", "pfAddr", pfAddr, "vfID", vfID)
return nil, err
}
guid = guidFromPool
}

return guid, nil
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The current implementation for random GUID generation relies on generateRandomGUID(), which uses math/rand. This is problematic for two main reasons:

  1. Predictable Sequences: If math/rand is not seeded, it will produce the same sequence of "random" numbers on every program start, leading to identical GUIDs being generated across different nodes and causing network collisions.
  2. Collision Risk: Even if seeded (e.g., with the current time), math/rand is not cryptographically secure and is not suitable for generating unique identifiers where collision avoidance is critical.

For generating unique identifiers like GUIDs, crypto/rand should be used as it provides cryptographically secure random numbers.

I suggest refactoring this function to use crypto/rand and to only generate a random GUID when a GUID cannot be retrieved from the pool. This also avoids unnecessary work.

Note: The suggestion requires importing crypto/rand and fmt.

func (i *infiniband) GetVfGUID(pfAddr string, vfID int) (net.HardwareAddr, error) {
	if i.guidPool != nil {
		guidFromPool, err := i.guidPool.GetVFGUID(pfAddr, vfID)
		if err != nil {
			log.Log.Error(err, "GetVfGUID(): failed to get GUID from IB GUID pool", "pfAddr", pfAddr, "vfID", vfID)
			return nil, err
		}
		return guidFromPool, nil
	}

	// Fallback to random GUID generation if pool is not available.
	// Using crypto/rand for cryptographically secure random numbers to avoid collisions.
	guid := make(net.HardwareAddr, 8)
	if _, err := rand.Read(guid); err != nil {
		return nil, fmt.Errorf("failed to generate random GUID: %w", err)
	}
	// Set U/L bit to indicate locally administered address.
	guid[0] |= 0x02
	// Unset I/G bit for unicast.
	guid[0] &^= 0x01
	return guid, nil
}

@bbenshab
Copy link
Author

bbenshab commented Feb 7, 2026

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a workaround for a kernel bug that limits netlink message sizes, affecting InfiniBand devices with many VFs. The approach of using ip commands directly instead of netlink for IB devices is a solid strategy. The code changes are well-structured, with new logic properly encapsulated.

My review has identified a potential issue where the workaround might not be applied if the linkType is not explicitly set in the SriovNetworkNodePolicy, as the device type detection relies on this optional field. I've suggested a more robust detection method using sysfs to ensure the workaround is always effective for IB devices. Additionally, I've pointed out a unit test that needs to be updated to correctly validate the new code path for IB configuration.

Overall, this is a good change that addresses a critical issue. Once the suggested improvements are incorporated, it should be ready for merging.

Comment on lines +516 to 527
var pfLink netlinkPkg.Link
if strings.EqualFold(iface.LinkType, consts.LinkTypeIB) {
log.Log.V(2).Info("configSriovVFDevices(): skipping netlink query for InfiniBand device, will use ip commands",
"device", iface.PciAddress, "linkType", iface.LinkType)
// pfLink remains nil for IB - we'll use ip commands instead
} else {
pfLink, err = s.netlinkLib.LinkByName(iface.Name)
if err != nil {
log.Log.Error(err, "configSriovVFDevices(): unable to get PF link for device", "device", iface)
return err
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This logic relies on iface.LinkType from the user's policy to identify an InfiniBand device. If the user doesn't specify this optional field, the code will fall back to using netlinkLib.LinkByName(), which is the very call this PR aims to avoid for IB devices, potentially re-introducing the "message too long" error.

To make this workaround more robust, I suggest detecting the link type from sysfs (/sys/class/net/<iface>/type) if it's not specified in the policy. This avoids the problematic netlink call while ensuring IB devices are always handled correctly.

                // Determine link type, preferring the policy spec but falling back to sysfs.
                // This is to ensure we apply the IB workaround even if linkType is not set in the policy.
                linkType := iface.LinkType
                if linkType == "" {
                        // Attempt to read from /sys/class/net/<iface>/type to avoid netlink call
                        typePath := filepath.Join(vars.FilesystemRoot, consts.SysClassNet, iface.Name, "type")
                        if data, err := os.ReadFile(typePath); err == nil {
                                if typeInt, err := strconv.Atoi(strings.TrimSpace(string(data))); err == nil {
                                        if typeInt == syscall.ARPHRD_INFINIBAND {
                                                linkType = consts.LinkTypeIB
                                        }
                                }
                        }
                }

		var pfLink netlinkPkg.Link
		var err error
		if strings.EqualFold(linkType, consts.LinkTypeIB) {
			log.Log.V(2).Info("configSriovVFDevices(): skipping netlink query for InfiniBand device, will use ip commands",
				"device", iface.PciAddress, "linkType", linkType)
			// pfLink remains nil for IB - we'll use ip commands instead
		} else {
			pfLink, err = s.netlinkLib.LinkByName(iface.Name)
			if err != nil {
				log.Log.Error(err, "configSriovVFDevices(): unable to get PF link for device", "device", iface)
				return err
			}
		}

Comment on lines +679 to 707
// For InfiniBand devices, use sysfs to avoid netlink PAGE_SIZE limit
if strings.EqualFold(iface.LinkType, consts.LinkTypeIB) {
// Check link state via sysfs
adminStatePath := filepath.Join(vars.FilesystemRoot, consts.SysClassNet, iface.Name, "operstate")
data, err := os.ReadFile(adminStatePath)
if err != nil {
return fmt.Errorf("failed to read link state: %w", err)
}
state := strings.TrimSpace(string(data))
// Set link up if not already up (operstate can be "up", "down", "unknown", etc.)
if state != "up" && state != "unknown" {
cmd := exec.Command("ip", "link", "set", iface.Name, "up")
if output, err := cmd.CombinedOutput(); err != nil {
return fmt.Errorf("failed to set link up: %w, output: %s", err, string(output))
}
}
} else {
// For non-IB devices, use standard netlink path
pfLink, err := s.netlinkLib.LinkByName(iface.Name)
if err != nil {
return err
}
if !s.netlinkLib.IsLinkAdminStateUp(pfLink) {
err = s.netlinkLib.LinkSetUp(pfLink)
if err != nil {
return err
}
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to my other comment, this logic relies on iface.LinkType from the user's policy. If linkType is not specified for an InfiniBand device, this code will take the else path and use netlink, which could trigger the kernel bug this PR is trying to work around.

To make this more robust, we should determine the link type by checking sysfs if it's not provided in the policy, before deciding which path to take.

	// Set PF link up
	// Determine link type, preferring the policy spec but falling back to sysfs.
	linkType := iface.LinkType
	if linkType == "" {
		// Attempt to read from /sys/class/net/<iface>/type to avoid netlink call
		typePath := filepath.Join(vars.FilesystemRoot, consts.SysClassNet, iface.Name, "type")
		if data, err := os.ReadFile(typePath); err == nil {
			if typeInt, err := strconv.Atoi(strings.TrimSpace(string(data))); err == nil {
				if typeInt == syscall.ARPHRD_INFINIBAND {
					linkType = consts.LinkTypeIB
				}
			}
		}
	}

	// For InfiniBand devices, use sysfs to avoid netlink PAGE_SIZE limit
	if strings.EqualFold(linkType, consts.LinkTypeIB) {
		// Check link state via sysfs
		adminStatePath := filepath.Join(vars.FilesystemRoot, consts.SysClassNet, iface.Name, "operstate")
		data, err := os.ReadFile(adminStatePath)
		if err != nil {
			return fmt.Errorf("failed to read link state: %w", err)
		}
		state := strings.TrimSpace(string(data))
		// Set link up if not already up (operstate can be "up", "down", "unknown", etc.)
		if state != "up" && state != "unknown" {
			cmd := exec.Command("ip", "link", "set", iface.Name, "up")
			if output, err := cmd.CombinedOutput(); err != nil {
				return fmt.Errorf("failed to set link up: %w, output: %s", err, string(output))
			}
		}
	} else {
		// For non-IB devices, use standard netlink path
		pfLink, err := s.netlinkLib.LinkByName(iface.Name)
		if err != nil {
			return err
		}
		if !s.netlinkLib.IsLinkAdminStateUp(pfLink) {
			err = s.netlinkLib.LinkSetUp(pfLink)
			if err != nil {
				return err
			}
		}
	}

@bbenshab
Copy link
Author

bbenshab commented Feb 7, 2026

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a workaround for a kernel regression that affects InfiniBand devices with a large number of VFs. The changes bypass problematic netlink queries by using ip commands and sysfs reads directly for IB device configuration. The approach is well-isolated and seems correct.

The refactoring of GUID generation into a separate GetVfGUID function using crypto/rand is a good improvement.

I've identified one area of code duplication in pkg/host/internal/sriov/sriov.go where the logic to determine the link type is repeated. I've left a comment suggesting to refactor this into a helper method to improve maintainability.

Overall, this is a solid contribution to address a tricky issue.

Comment on lines +523 to +536
// Determine link type, preferring the policy spec but falling back to sysfs.
// This ensures we apply the IB workaround even if linkType is not set in the policy.
linkType := iface.LinkType
if linkType == "" {
// Attempt to read from /sys/class/net/<iface>/type to avoid netlink call
typePath := filepath.Join(vars.FilesystemRoot, consts.SysClassNet, iface.Name, "type")
if data, err := os.ReadFile(typePath); err == nil {
if typeInt, err := strconv.Atoi(strings.TrimSpace(string(data))); err == nil {
if typeInt == ARPHRD_INFINIBAND {
linkType = consts.LinkTypeIB
}
}
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic for determining the link type by checking the spec and falling back to sysfs is duplicated in configSriovDevice (lines 702-713). To improve maintainability and reduce code duplication, consider extracting this logic into a new helper method on the sriov struct.

For example, you could add:

// getLinkTypeWithIBFallback determines the link type for an interface.
// It prefers the type from the spec, but falls back to checking sysfs for InfiniBand.
func (s *sriov) getLinkTypeWithIBFallback(iface *sriovnetworkv1.Interface) string {
	linkType := iface.LinkType
	if linkType == "" {
		// Attempt to read from /sys/class/net/<iface>/type to avoid netlink call
		typePath := filepath.Join(vars.FilesystemRoot, consts.SysClassNet, iface.Name, "type")
		if data, err := os.ReadFile(typePath); err == nil {
			if typeInt, err := strconv.Atoi(strings.TrimSpace(string(data))); err == nil {
				if typeInt == ARPHRD_INFINIBAND {
					linkType = consts.LinkTypeIB
				}
			}
		}
	}
	return linkType
}

Then, you can replace this block and the one in configSriovDevice with a single call:

linkType := s.getLinkTypeWithIBFallback(iface)

@bbenshab
Copy link
Author

bbenshab commented Feb 7, 2026

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a workaround for a kernel regression that causes issues with InfiniBand devices having a large number of Virtual Functions. The changes cleverly bypass problematic netlink queries for these devices by using ip commands and direct sysfs reads, which is a well-isolated and effective solution. The code has been refactored to accommodate this, for example, by extracting the GUID generation logic into a reusable function. My review includes a suggestion to improve the readability of a new helper function. Overall, the changes are well-implemented and address the problem effectively.

Comment on lines +230 to +244
func (s *sriov) getLinkTypeWithIBFallback(iface *sriovnetworkv1.Interface) string {
linkType := iface.LinkType
if linkType == "" {
// Attempt to read from /sys/class/net/<iface>/type to avoid netlink call
typePath := filepath.Join(vars.FilesystemRoot, consts.SysClassNet, iface.Name, "type")
if data, err := os.ReadFile(typePath); err == nil {
if typeInt, err := strconv.Atoi(strings.TrimSpace(string(data))); err == nil {
if typeInt == ARPHRD_INFINIBAND {
linkType = consts.LinkTypeIB
}
}
}
}
return linkType
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The nested if statements in this function make it a bit difficult to follow. Refactoring to use early returns would flatten the structure and improve readability and maintainability. Additionally, adding verbose logging for cases where reading or parsing the link type from sysfs fails could be beneficial for future debugging, even if the errors are intentionally ignored in this fallback mechanism.

func (s *sriov) getLinkTypeWithIPFallback(iface *sriovnetworkv1.Interface) string {
	if iface.LinkType != "" {
		return iface.LinkType
	}

	// Attempt to read from /sys/class/net/<iface>/type to avoid netlink call
	typePath := filepath.Join(vars.FilesystemRoot, consts.SysClassNet, iface.Name, "type")
	data, err := os.ReadFile(typePath)
	if err != nil {
		log.Log.V(2).Info("getLinkTypeWithIBFallback(): failed to read link type from sysfs, ignoring", "path", typePath, "error", err)
		return ""
	}

	typeInt, err := strconv.Atoi(strings.TrimSpace(string(data)))
	if err != nil {
		log.Log.V(2).Info("getLinkTypeWithIBFallback(): failed to parse link type from sysfs, ignoring", "path", typePath, "data", string(data), "error", err)
		return ""
	}

	if typeInt == ARPHRD_INFINIBAND {
		return consts.LinkTypeIB
	}

	return ""
}

@bbenshab
Copy link
Author

bbenshab commented Feb 7, 2026

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a workaround for a kernel netlink PAGE_SIZE limitation affecting InfiniBand devices with a large number of virtual functions. The changes bypass netlink queries for these devices, opting instead for direct ip command execution and sysfs reads to configure VF GUIDs and manage link states.

The approach is well-contained and effectively addresses the issue. The code is clear and the refactoring in infiniband.go improves modularity.

I have one suggestion to improve the robustness of the link state check for InfiniBand devices. Additionally, while manual testing has been performed, I recommend adding unit tests for the new code paths that execute ip commands to ensure long-term maintainability.

Comment on lines +720 to +733
// Check link state via sysfs
adminStatePath := filepath.Join(vars.FilesystemRoot, consts.SysClassNet, iface.Name, "operstate")
data, err := os.ReadFile(adminStatePath)
if err != nil {
return fmt.Errorf("failed to read link state: %w", err)
}
state := strings.TrimSpace(string(data))
// Set link up if not already up (operstate can be "up", "down", "unknown", etc.)
if state != "up" && state != "unknown" {
cmd := exec.Command("ip", "link", "set", iface.Name, "up")
if output, err := cmd.CombinedOutput(); err != nil {
return fmt.Errorf("failed to set link up: %w, output: %s", err, string(output))
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation checks the operstate to decide whether to bring the interface up. This differs from the original logic for non-IB devices, which checks the administrative state (IFF_UP flag). Using operstate can be unreliable as its value can be unknown for some drivers, even if the interface is administratively down. In such cases, the interface would not be brought up, which is a change from the original behavior.

To align with the original logic and make the check more robust, I suggest reading the administrative flags from /sys/class/net/<iface>/flags and checking for the IFF_UP bit.

Suggested change
// Check link state via sysfs
adminStatePath := filepath.Join(vars.FilesystemRoot, consts.SysClassNet, iface.Name, "operstate")
data, err := os.ReadFile(adminStatePath)
if err != nil {
return fmt.Errorf("failed to read link state: %w", err)
}
state := strings.TrimSpace(string(data))
// Set link up if not already up (operstate can be "up", "down", "unknown", etc.)
if state != "up" && state != "unknown" {
cmd := exec.Command("ip", "link", "set", iface.Name, "up")
if output, err := cmd.CombinedOutput(); err != nil {
return fmt.Errorf("failed to set link up: %w, output: %s", err, string(output))
}
}
// Check link admin state via sysfs to mirror the netlink behavior more closely.
// The original code checked `IsLinkAdminStateUp`, which corresponds to the IFF_UP flag.
flagsPath := filepath.Join(vars.FilesystemRoot, consts.SysClassNet, iface.Name, "flags")
data, err := os.ReadFile(flagsPath)
if err != nil {
return fmt.Errorf("failed to read link flags for %s: %w", iface.Name, err)
}
flags, err := strconv.ParseInt(strings.TrimSpace(string(data)), 0, 64)
if err != nil {
return fmt.Errorf("failed to parse link flags for %s ('%s'): %w", iface.Name, string(data), err)
}
// IFF_UP is the first bit (0x1)
isUp := (flags & 0x1) != 0
if !isUp {
cmd := exec.Command("ip", "link", "set", iface.Name, "up")
if output, err := cmd.CombinedOutput(); err != nil {
return fmt.Errorf("failed to set link up for %s: %w, output: %s", iface.Name, err, string(output))
}
}

@bbenshab
Copy link
Author

bbenshab commented Feb 7, 2026

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a workaround for a kernel bug that causes issues with InfiniBand devices having a large number of Virtual Functions. The changes cleverly bypass problematic netlink calls by using ip commands and direct sysfs reads for device configuration, specifically for IB devices. The code is well-structured, with clear separation of concerns, such as the new GetVfGUID function. The changes are isolated to InfiniBand devices, ensuring that other devices continue to function as before. Overall, this is a solid implementation of the described solution. I have one minor suggestion to improve code clarity.

return fmt.Errorf("failed to parse link flags for %s ('%s'): %w", iface.Name, string(data), err)
}
// IFF_UP is the first bit (0x1)
isUp := (flags & 0x1) != 0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better readability and to avoid using a magic number, it's recommended to use the syscall.IFF_UP constant. The syscall package is already imported in this file.

Suggested change
isUp := (flags & 0x1) != 0
isUp := (flags & syscall.IFF_UP) != 0

@bbenshab
Copy link
Author

bbenshab commented Feb 7, 2026

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a workaround for a kernel bug affecting InfiniBand devices by avoiding netlink calls and using ip commands directly. The changes are well-structured, introducing specific logic for IB devices while keeping the existing paths for other device types. The refactoring of GUID generation is also a good improvement. I've found one critical issue where a variable is shadowed, which accidentally re-introduces the netlink call this PR aims to remove. I've also included a suggestion to improve performance. Once the critical issue is addressed, the changes look solid.

// exceed this limit. We'll use 'ip link set' commands directly instead.
// See: https://issues.redhat.com/browse/OCPBUGS-74637

linkType := s.getLinkTypeWithIBFallback(iface)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This linkType is correctly determined without using netlink. However, it is being shadowed by another linkType variable declared on line 604 within the loop. That inner linkType is determined using s.GetLinkType(), which makes a netlink call and re-introduces the bug this PR is trying to fix.

Please remove the variable shadowing on lines 604-607 to ensure this linkType is used throughout the function.

log.Log.Info("configureVfGUIDWithIPCommand(): set vf guid", "address", vfAddr, "guid", guidStr)

// Set both node_guid and port_guid using ip command
for _, guidType := range []string{"node_guid", "port_guid"} {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To avoid re-allocating this string slice on every function call, consider defining it as a package-level constant or variable. For example:

var ibGUIDTypes = []string{"node_guid", "port_guid"}

Then, you can use it here:

for _, guidType := range ibGUIDTypes {

@bbenshab
Copy link
Author

bbenshab commented Feb 7, 2026

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a workaround for a kernel netlink PAGE_SIZE limitation affecting InfiniBand devices with many VFs, by bypassing netlink for IB devices and using sysfs for discovery and ip link commands for configuration. A security audit found no significant vulnerabilities, highlighting positive aspects such as secure GUID generation using crypto/rand, safe command execution via os/exec, and robust sysfs path handling. I've identified a couple of areas for improvement regarding code duplication and the use of a magic number, for which specific comments have been left. Overall, this is a solid and effective contribution.

Comment on lines +249 to +253
if typeInt == ARPHRD_INFINIBAND {
return consts.LinkTypeIB
} else if typeInt == 1 { // ARPHRD_ETHER
return consts.LinkTypeETH
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To improve readability and maintainability, it's better to use a named constant for the magic number 1. Please consider defining ARPHRD_ETHER = 1 (from if_arp.h) alongside ARPHRD_INFINIBAND and using the constant here as suggested.

Suggested change
if typeInt == ARPHRD_INFINIBAND {
return consts.LinkTypeIB
} else if typeInt == 1 { // ARPHRD_ETHER
return consts.LinkTypeETH
}
if typeInt == ARPHRD_INFINIBAND {
return consts.LinkTypeIB
} else if typeInt == ARPHRD_ETHER {
return consts.LinkTypeETH
}

Comment on lines 324 to 362
var iface sriovnetworkv1.InterfaceExt

if isIB {
// For InfiniBand devices with many VFs, avoid netlink query
log.Log.V(2).Info("DiscoverSriovDevices(): detected InfiniBand device, using sysfs instead of netlink",
"device", device.Address, "name", pfNetName)
iface = sriovnetworkv1.InterfaceExt{
Name: pfNetName,
PciAddress: device.Address,
Driver: driver,
Vendor: device.Vendor.ID,
DeviceID: device.Product.ID,
Mtu: s.networkHelper.GetNetdevMTU(device.Address),
Mac: s.networkHelper.GetNetDevMac(pfNetName),
LinkType: linkType,
LinkSpeed: s.networkHelper.GetNetDevLinkSpeed(pfNetName),
LinkAdminState: s.networkHelper.GetNetDevLinkAdminState(pfNetName),
}
} else {
// For non-IB devices, use netlink as before
link, err := s.netlinkLib.LinkByName(pfNetName)
if err != nil {
log.Log.Error(err, "DiscoverSriovDevices(): unable to get Link for device, skipping", "device", device.Address)
continue
}

iface := sriovnetworkv1.InterfaceExt{
Name: pfNetName,
PciAddress: device.Address,
Driver: driver,
Vendor: device.Vendor.ID,
DeviceID: device.Product.ID,
Mtu: link.Attrs().MTU,
Mac: link.Attrs().HardwareAddr.String(),
LinkType: s.encapTypeToLinkType(link.Attrs().EncapType),
LinkSpeed: s.networkHelper.GetNetDevLinkSpeed(pfNetName),
LinkAdminState: s.networkHelper.GetNetDevLinkAdminState(pfNetName),
iface = sriovnetworkv1.InterfaceExt{
Name: pfNetName,
PciAddress: device.Address,
Driver: driver,
Vendor: device.Vendor.ID,
DeviceID: device.Product.ID,
Mtu: link.Attrs().MTU,
Mac: link.Attrs().HardwareAddr.String(),
LinkType: s.encapTypeToLinkType(link.Attrs().EncapType),
LinkSpeed: s.networkHelper.GetNetDevLinkSpeed(pfNetName),
LinkAdminState: s.networkHelper.GetNetDevLinkAdminState(pfNetName),
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is significant code duplication in the if/else block for populating the iface struct. You can refactor this by initializing the common fields first, and then setting the differing fields (Mtu, Mac, LinkType) inside the conditional blocks. This will make the code more concise and easier to maintain.

  iface := sriovnetworkv1.InterfaceExt{
   Name:           pfNetName,
   PciAddress:     device.Address,
   Driver:         driver,
   Vendor:         device.Vendor.ID,
   DeviceID:       device.Product.ID,
   LinkSpeed:      s.networkHelper.GetNetDevLinkSpeed(pfNetName),
   LinkAdminState: s.networkHelper.GetNetDevLinkAdminState(pfNetName),
  }

  if isIB {
   // For InfiniBand devices with many VFs, avoid netlink query
   log.Log.V(2).Info("DiscoverSriovDevices(): detected InfiniBand device, using sysfs instead of netlink",
    "device", device.Address, "name", pfNetName)
   iface.Mtu = s.networkHelper.GetNetdevMTU(device.Address)
   iface.Mac = s.networkHelper.GetNetDevMac(pfNetName)
   iface.LinkType = linkType
  } else {
   // For non-IB devices, use netlink as before
   link, err := s.netlinkLib.LinkByName(pfNetName)
   if err != nil {
    log.Log.Error(err, "DiscoverSriovDevices(): unable to get Link for device, skipping", "device", device.Address)
    continue
   }

   iface.Mtu = link.Attrs().MTU
   iface.Mac = link.Attrs().HardwareAddr.String()
   iface.LinkType = s.encapTypeToLinkType(link.Attrs().EncapType)
  }

@bbenshab
Copy link
Author

bbenshab commented Feb 7, 2026

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a workaround for a kernel regression that limits netlink message sizes, causing issues with InfiniBand devices that have many VFs. The solution bypasses netlink for IB devices by using ip commands and reading from sysfs directly. However, it introduces potential path traversal vulnerabilities by using unvalidated interface names from the Custom Resource spec to construct file paths on the host. These names should be strictly validated or matched against discovered system devices before being used in file operations. Additionally, there is a lack of unit tests for these new critical code paths; adding tests would significantly increase confidence in this workaround.

Comment on lines +1361 to +1385
func (s *sriov) configureVfGUIDWithIPCommand(vfAddr string, pfAddr string, pfName string, vfID int) error {
log.Log.V(2).Info("configureVfGUIDWithIPCommand(): configure vf guid using ip command",
"vfAddr", vfAddr, "pfAddr", pfAddr, "pfName", pfName, "vfID", vfID)

// Get GUID from pool or generate random (same logic as ConfigureVfGUID)
guid, err := s.infinibandHelper.GetVfGUID(pfAddr, vfID)
if err != nil {
return fmt.Errorf("failed to get VF GUID: %w", err)
}

// Format GUID as hex string with colons using net.HardwareAddr
guidStr := guid.String()

log.Log.Info("configureVfGUIDWithIPCommand(): set vf guid", "address", vfAddr, "guid", guidStr)

// Set both node_guid and port_guid using ip command
for _, guidType := range ibGUIDTypes {
cmd := exec.Command("ip", "link", "set", "dev", pfName, "vf", strconv.Itoa(vfID), guidType, guidStr)
if output, err := cmd.CombinedOutput(); err != nil {
return fmt.Errorf("failed to set %s: %w, output: %s", guidType, err, string(output))
}
}

return nil
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This new function, along with others in this PR, introduces a critical code path using exec.Command to work around the netlink issue. While the implementation looks correct and safe against command injection, it's crucial to have unit tests covering this new logic to ensure its correctness and robustness.
Please add tests for configureVfGUIDWithIPCommand and the other new code paths (like the one in configSriovDevice for setting the link up). You could use mocks for exec.Command and os.ReadFile to verify that:

  • The correct commands and arguments are being constructed.
  • The logic correctly handles both success and failure cases from external commands and file system operations.

Comment on lines +239 to +240
typePath := filepath.Join(vars.FilesystemRoot, consts.SysClassNet, ifaceName, "type")
data, err := os.ReadFile(typePath)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The iface.Name field, which originates from the SriovNetworkNodeState Custom Resource, is used to construct file paths in getLinkTypeFromSysfs without proper validation or sanitization. An attacker with permissions to modify the CR can provide a malicious name containing path traversal sequences (e.g., ../../etc/shadow/..) to attempt to read arbitrary files on the host system. While the impact is partially mitigated by the hardcoded suffix (/type), the content of the read file is exposed in logs if parsing fails (line 248), leading to potential information disclosure.

Comment on lines +747 to +748
flagsPath := filepath.Join(vars.FilesystemRoot, consts.SysClassNet, iface.Name, "flags")
data, err := os.ReadFile(flagsPath)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

Similar to the issue in getLinkTypeFromSysfs, the iface.Name field is used here to construct a path to the interface flags file. An attacker could use path traversal to read files named flags in other directories. The content of the file is explicitly included in the error message on line 754 if parsing fails, which can be exposed to users via the Custom Resource status or system logs.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a workaround for a kernel regression that causes issues with InfiniBand devices having a large number of Virtual Functions. The changes bypass netlink queries, which can fail due to message size limits, and instead use ip commands and direct sysfs reads for device configuration. The implementation is well-isolated and robust, with good error handling. I've identified one area with some redundant code that could be simplified.

Comment on lines 632 to 640
if pfLink == nil {
if err := s.configureVfGUIDWithIPCommand(addr, iface.PciAddress, iface.Name, vfID); err != nil {
return err
}
} else {
if err := s.infinibandHelper.ConfigureVfGUID(addr, iface.PciAddress, vfID, pfLink); err != nil {
return err
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The else block starting at line 636 appears to be unreachable. The pfLink variable is intentionally set to nil for InfiniBand devices at line 585 to bypass netlink queries. The check at line 629 ensures this block is only entered for InfiniBand devices. Therefore, pfLink == nil will always be true, and the else branch will never be executed. This can be simplified by removing the conditional and the unreachable else block.

if err := s.configureVfGUIDWithIPCommand(addr, iface.PciAddress, iface.Name, vfID); err != nil {
    return err
}

@bbenshab
Copy link
Author

bbenshab commented Feb 7, 2026

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request provides a solid workaround for the kernel netlink PAGE_SIZE limitation affecting InfiniBand devices with a large number of VFs. The approach of bypassing netlink calls for IB devices and using ip commands and sysfs reads is well-implemented and cleanly isolated. The refactoring of GUID generation into GetVfGUID and its use of crypto/rand are good improvements. The changes are well-commented and the logic is easy to follow. I have one suggestion to simplify a piece of code by removing an unreachable else block.

Comment on lines 632 to 640
if pfLink == nil {
if err := s.configureVfGUIDWithIPCommand(addr, iface.PciAddress, iface.Name, vfID); err != nil {
return err
}
} else {
if err := s.infinibandHelper.ConfigureVfGUID(addr, iface.PciAddress, vfID, pfLink); err != nil {
return err
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The else block starting on line 636 appears to be unreachable. For InfiniBand devices, pfLink is intentionally set to nil before this loop to avoid problematic netlink calls. Consequently, the pfLink == nil check on line 632 will always be true for IB devices, making the else branch dead code. You can simplify this by removing the conditional and directly calling configureVfGUIDWithIPCommand.

                                                if err := s.configureVfGUIDWithIPCommand(addr, iface.PciAddress, iface.Name, vfID); err != nil {
                                                        return err
                                                }

@bbenshab
Copy link
Author

bbenshab commented Feb 7, 2026

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a workaround for a kernel regression that limits netlink message sizes, causing failures when configuring InfiniBand devices with many VFs. The approach of bypassing netlink for IB devices and using ip commands and sysfs directly is sound. The changes are well-isolated to InfiniBand device handling.

I've found one critical issue where some helper functions that still use netlink are called for InfiniBand devices, which would defeat the purpose of this workaround. I've left a specific comment with a suggested fix for this.

Otherwise, the code is well-structured and the new functions are clear. The refactoring in infiniband.go to extract GetVfGUID is a good improvement.

Comment on lines +341 to +342
iface.Mtu = s.networkHelper.GetNetdevMTU(device.Address)
iface.Mac = s.networkHelper.GetNetDevMac(pfNetName)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The functions s.networkHelper.GetNetdevMTU() and s.networkHelper.GetNetDevMac() both rely on netlink.LinkByName() internally. This will re-introduce the 'message too long' error for InfiniBand devices with many VFs, as querying the PF link via netlink is what this PR aims to avoid.

To fix this, the MTU and MAC address should be read directly from sysfs for InfiniBand devices, similar to how getLinkTypeFromSysfs is implemented.

Suggested change
iface.Mtu = s.networkHelper.GetNetdevMTU(device.Address)
iface.Mac = s.networkHelper.GetNetDevMac(pfNetName)
if data, err := os.ReadFile(filepath.Join(vars.FilesystemRoot, consts.SysClassNet, pfNetName, "mtu")); err == nil {
iface.Mtu, _ = strconv.Atoi(strings.TrimSpace(string(data)))
}
if data, err := os.ReadFile(filepath.Join(vars.FilesystemRoot, consts.SysClassNet, pfNetName, "address")); err == nil {
iface.Mac = strings.TrimSpace(string(data))
}

@bbenshab
Copy link
Author

bbenshab commented Feb 8, 2026

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a workaround for a kernel regression that limits netlink message sizes, affecting InfiniBand devices with many VFs. The changes cleverly bypass netlink for IB devices by using direct ip commands and sysfs reads for device discovery and configuration. This is a well-isolated and necessary solution. The code is well-structured, and the changes for IB devices are clearly separated from the existing logic for other devices.

My review includes a few minor suggestions to improve logging consistency for better debuggability and to clean up a duplicated code comment. Overall, this is a solid contribution that addresses a critical issue.

Comment on lines +392 to +394
if data, err := os.ReadFile(filepath.Join(vars.FilesystemRoot, consts.SysClassNet, pfNetName, "address")); err == nil {
iface.Mac = strings.TrimSpace(string(data))
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency with how MTU and admin state failures are handled in this function, consider adding a log message when reading the MAC address from sysfs fails. This will improve debuggability.

Suggested change
if data, err := os.ReadFile(filepath.Join(vars.FilesystemRoot, consts.SysClassNet, pfNetName, "address")); err == nil {
iface.Mac = strings.TrimSpace(string(data))
}
if data, err := os.ReadFile(filepath.Join(vars.FilesystemRoot, consts.SysClassNet, pfNetName, "address")); err == nil {
iface.Mac = strings.TrimSpace(string(data))
} else {
log.Log.V(2).Info("DiscoverSriovDevices(): failed to get MAC from sysfs for IB device", "name", pfNetName, "error", err)
}

if err != nil {
log.Log.Error(err, "configSriovPFDevice(): fail to set mtu for PF", "device", iface.PciAddress)
return err
// set PF mtu

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This comment is a duplicate of the one on the preceding line and can be removed to improve code clarity.

Comment on lines +552 to +554
if mtu, err := s.getMTUFromSysfs(ifName); err == nil {
currentMtu = mtu
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When getMTUFromSysfs returns an error, it is silently ignored. This could make debugging difficult. Consider adding a log message in an else block to record the error, which would be consistent with other parts of the code.

            if mtu, err := s.getMTUFromSysfs(ifName); err == nil {
                currentMtu = mtu
            } else {
                log.Log.V(2).Info("configSriovPFDevice(): failed to get MTU from sysfs for IB device", "name", ifName, "error", err)
            }

Comment on lines +649 to +651
if mtu, err := s.getMTUFromSysfs(ifName); err == nil {
currentMtu = mtu
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When getMTUFromSysfs returns an error, it is silently ignored. This could make debugging difficult. Consider adding a log message in the else block to record the error, similar to how it's done in DiscoverSriovDevices.

            if mtu, err := s.getMTUFromSysfs(ifName); err == nil {
                currentMtu = mtu
            } else {
                log.Log.V(2).Info("checkExternallyManagedPF(): failed to get MTU from sysfs for IB device", "name", ifName, "error", err)
            }

@bbenshab bbenshab force-pushed the fix/infiniband-netlink-buffer-size branch from f4d52c9 to 0e4cd66 Compare February 8, 2026 09:19
@bbenshab
Copy link
Author

bbenshab commented Feb 8, 2026

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a workaround for a kernel regression that causes 'message too long' errors when configuring InfiniBand devices with a large number of virtual functions. The solution is to bypass netlink library calls for InfiniBand devices and instead use ip commands for configuration and read from sysfs for device state.

The changes are well-isolated to handle InfiniBand devices differently, while retaining the existing logic for other device types. New helper functions for interacting with sysfs and executing ip commands have been added, which improves modularity. The refactoring in the infiniband package to separate GUID generation is also a good improvement.

Overall, this is a solid implementation of the described workaround. I have a couple of minor suggestions for code cleanup.

log.Log.V(2).Info("DiscoverSriovDevices(): failed to get MAC from sysfs for IB device", "name", pfNetName, "error", err)
}
iface.LinkType = linkType
iface.LinkSpeed = s.networkHelper.GetNetDevLinkSpeed(pfNetName)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This call to s.networkHelper.GetNetDevLinkSpeed(pfNetName) is also present in the else branch on line 415. To avoid code duplication, consider moving this line outside of the if-else block.

InfiniBand devices with 16+ VFs exceed the kernel's PAGE_SIZE limit
(~4KB) when querying via netlink. This causes 'message too long' errors
during discovery and configuration.

Root Cause:
- Kernel netlink responses are capped at PAGE_SIZE regardless of socket
  buffer configuration
- IB devices return extensive metadata (MAC, VLAN, Node GUID, Port GUID,
  RDMA stats) that exceeds this limit with many VFs

Solution:
- Bypass netlink queries for IB devices by using sysfs and ip commands
- Detect IB devices via /sys/class/net/<iface>/type (ARPHRD_INFINIBAND=32)
- Read attributes directly from sysfs (MTU, MAC, link flags, link type)
- Set attributes via 'ip link set' commands to avoid netlink

Changes:
1. Added helper functions for sysfs access:
   - getLinkTypeFromSysfs(): Detect IB vs Ethernet
   - getMTUFromSysfs(): Read MTU from sysfs
   - setMTUViaIPCommand(): Set MTU via ip command
   - getLinkAdminStateFromSysfs(): Read IFF_UP flag from sysfs
   - configureVfGUIDWithIPCommand(): Set VF GUIDs via ip command

2. Fixed DiscoverSriovDevices():
   - Check link type before using netlink
   - Use sysfs for IB device discovery

3. Fixed DiscoverSriovVirtualDevices():
   - Use sysfs for IB VF discovery

4. Fixed configSriovDevice():
   - Set PF link up via ip command for IB devices

5. Fixed configSriovVFDevices():
   - Skip netlink LinkByName for IB PFs
   - Use ip commands for VF GUID configuration

6. Fixed ResetSriovDevice():
   - Use ip command for IB MTU reset

7. Fixed configSriovPFDevice():
   - Check link type before MTU operations
   - Use sysfs/ip commands for IB devices

8. Fixed checkExternallyManagedPF():
   - Use sysfs for IB MTU checks

9. Added constants:
   - ARPHRD_ETHER = 1
   - ARPHRD_INFINIBAND = 32

10. Updated InfinibandInterface:
    - Added GetVfGUID() method for GUID retrieval
    - Use crypto/rand for secure random GUID generation

Fixes: OCPBUGS-74889
Fixes: RHEL-145522
@bbenshab bbenshab force-pushed the fix/infiniband-netlink-buffer-size branch from 0e4cd66 to 3e6eaa9 Compare February 8, 2026 09:34
@bbenshab
Copy link
Author

bbenshab commented Feb 8, 2026

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a workaround for a kernel regression that limits netlink message size, affecting InfiniBand devices with many VFs. The changes bypass netlink queries for IB devices by using direct ip commands and sysfs reads for device discovery and configuration. The overall approach is sound and well-isolated to InfiniBand device handling. The code is clear and the changes are consistent across the affected functions. I've added a few suggestions to improve maintainability by reducing some code duplication. These are minor refactoring opportunities and don't affect the correctness of the implementation.

Comment on lines +382 to 417
if isIB {
// For InfiniBand devices with many VFs, avoid netlink query
log.Log.V(2).Info("DiscoverSriovDevices(): detected InfiniBand device, using sysfs instead of netlink",
"device", device.Address, "name", pfNetName)
// Read MTU and MAC directly from sysfs to avoid netlink
if mtu, err := s.getMTUFromSysfs(pfNetName); err == nil {
iface.Mtu = mtu
} else {
log.Log.V(2).Info("DiscoverSriovDevices(): failed to get MTU from sysfs for IB device", "name", pfNetName, "error", err)
}
if data, err := os.ReadFile(filepath.Join(vars.FilesystemRoot, consts.SysClassNet, pfNetName, "address")); err == nil {
iface.Mac = strings.TrimSpace(string(data))
} else {
log.Log.V(2).Info("DiscoverSriovDevices(): failed to get MAC from sysfs for IB device", "name", pfNetName, "error", err)
}
iface.LinkType = linkType
iface.LinkSpeed = s.networkHelper.GetNetDevLinkSpeed(pfNetName)
if adminState, err := s.getLinkAdminStateFromSysfs(pfNetName); err == nil {
iface.LinkAdminState = adminState
} else {
log.Log.V(2).Info("DiscoverSriovDevices(): failed to get link admin state from sysfs for IB device", "name", pfNetName, "error", err)
}
} else {
// For non-IB devices, use netlink as before
link, err := s.netlinkLib.LinkByName(pfNetName)
if err != nil {
log.Log.Error(err, "DiscoverSriovDevices(): unable to get Link for device, skipping", "device", device.Address)
continue
}

iface.Mtu = link.Attrs().MTU
iface.Mac = link.Attrs().HardwareAddr.String()
iface.LinkType = s.encapTypeToLinkType(link.Attrs().EncapType)
iface.LinkSpeed = s.networkHelper.GetNetDevLinkSpeed(pfNetName)
iface.LinkAdminState = s.networkHelper.GetNetDevLinkAdminState(pfNetName)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's some code duplication between the if isIB and else blocks. Specifically, iface.LinkSpeed = s.networkHelper.GetNetDevLinkSpeed(pfNetName) is called in both branches. You can move this assignment to before the if block to reduce duplication and improve maintainability.

if err != nil {
log.Log.Error(err, "configSriovPFDevice(): fail to set mtu for PF", "device", iface.PciAddress)
return err
// set PF mtu

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This comment is a duplicate of the one on the preceding line. Please remove it to improve code clarity.

Comment on lines +544 to 574
if iface.Mtu > 0 {
linkType := s.getLinkTypeWithIBFallback(iface)
var currentMtu int
if strings.EqualFold(linkType, consts.LinkTypeIB) {
// Use sysfs for IB to avoid netlink PAGE_SIZE limit
ifName := s.networkHelper.TryGetInterfaceName(iface.PciAddress)
if ifName == "" {
log.Log.Error(nil, "configSriovPFDevice(): failed to get interface name", "device", iface.PciAddress)
return fmt.Errorf("failed to get interface name for %s", iface.PciAddress)
}
if mtu, err := s.getMTUFromSysfs(ifName); err == nil {
currentMtu = mtu
}
if iface.Mtu > currentMtu {
if err := s.setMTUViaIPCommand(ifName, iface.Mtu); err != nil {
log.Log.Error(err, "configSriovPFDevice(): fail to set mtu for PF", "device", iface.PciAddress)
return err
}
}
} else {
// Use netlink for non-IB devices
currentMtu = s.networkHelper.GetNetdevMTU(iface.PciAddress)
if iface.Mtu > currentMtu {
err = s.networkHelper.SetNetdevMTU(iface.PciAddress, iface.Mtu)
if err != nil {
log.Log.Error(err, "configSriovPFDevice(): fail to set mtu for PF", "device", iface.PciAddress)
return err
}
}
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for setting the MTU, including the check iface.Mtu > currentMtu and the error handling, is duplicated between the InfiniBand and non-InfiniBand cases. This can be refactored to reduce duplication and improve maintainability. You could determine currentMtu and the appropriate setMtu function within the if/else block, and then call the function and handle the error once after the block.

Comment on lines +645 to +657
linkType := s.getLinkTypeWithIBFallback(iface)
var currentMtu int
if strings.EqualFold(linkType, consts.LinkTypeIB) {
// Use sysfs for IB to avoid netlink PAGE_SIZE limit
ifName := s.networkHelper.TryGetInterfaceName(iface.PciAddress)
if ifName != "" {
if mtu, err := s.getMTUFromSysfs(ifName); err == nil {
currentMtu = mtu
}
}
} else {
currentMtu = s.networkHelper.GetNetdevMTU(iface.PciAddress)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic to get the current MTU based on the link type is also used in configSriovPFDevice. To avoid code duplication, consider extracting this logic into a new helper function, for example getCurrentMtu(iface *sriovnetworkv1.Interface) int.

@bbenshab
Copy link
Author

bbenshab commented Feb 8, 2026

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a workaround for a kernel bug that limits netlink message sizes, affecting InfiniBand devices with many VFs. The solution bypasses netlink by using ip commands and reading from sysfs directly, which is a sound approach. The changes are well-contained, and the use of crypto/rand for GUID generation is a good security improvement. My review includes suggestions to improve maintainability by reducing code duplication and to fix a potential bug in the validation logic for externally managed devices to ensure more robust error handling.

Comment on lines +647 to +654
if strings.EqualFold(linkType, consts.LinkTypeIB) {
// Use sysfs for IB to avoid netlink PAGE_SIZE limit
ifName := s.networkHelper.TryGetInterfaceName(iface.PciAddress)
if ifName != "" {
if mtu, err := s.getMTUFromSysfs(ifName); err == nil {
currentMtu = mtu
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In checkExternallyManagedPF, if TryGetInterfaceName returns an empty string for an InfiniBand device, the function proceeds silently, and currentMtu remains 0. This can lead to incorrect validation, as the subsequent MTU check (iface.Mtu > currentMtu) might pass when it should have been validated against the actual device MTU. This could allow an invalid configuration for an externally managed device. It would be safer to return an error if the interface name or MTU cannot be determined, ensuring the check is robust.

Suggested change
if strings.EqualFold(linkType, consts.LinkTypeIB) {
// Use sysfs for IB to avoid netlink PAGE_SIZE limit
ifName := s.networkHelper.TryGetInterfaceName(iface.PciAddress)
if ifName != "" {
if mtu, err := s.getMTUFromSysfs(ifName); err == nil {
currentMtu = mtu
}
}
if strings.EqualFold(linkType, consts.LinkTypeIB) {
// Use sysfs for IB to avoid netlink PAGE_SIZE limit
ifName := s.networkHelper.TryGetInterfaceName(iface.PciAddress)
if ifName == "" {
err := fmt.Errorf("checkExternallyManagedPF(): failed to get interface name for IB device %s", iface.PciAddress)
log.Log.Error(err, "")
return err
}
mtu, err := s.getMTUFromSysfs(ifName)
if err != nil {
log.Log.Error(err, "checkExternallyManagedPF(): failed to get MTU from sysfs for IB device", "name", ifName)
return err
}
currentMtu = mtu

if err != nil {
log.Log.Error(err, "configSriovPFDevice(): fail to set mtu for PF", "device", iface.PciAddress)
return err
// set PF mtu

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This line contains a duplicated comment. It can be removed to improve code clarity.

Comment on lines 545 to 573
linkType := s.getLinkTypeWithIBFallback(iface)
var currentMtu int
if strings.EqualFold(linkType, consts.LinkTypeIB) {
// Use sysfs for IB to avoid netlink PAGE_SIZE limit
ifName := s.networkHelper.TryGetInterfaceName(iface.PciAddress)
if ifName == "" {
log.Log.Error(nil, "configSriovPFDevice(): failed to get interface name", "device", iface.PciAddress)
return fmt.Errorf("failed to get interface name for %s", iface.PciAddress)
}
if mtu, err := s.getMTUFromSysfs(ifName); err == nil {
currentMtu = mtu
}
if iface.Mtu > currentMtu {
if err := s.setMTUViaIPCommand(ifName, iface.Mtu); err != nil {
log.Log.Error(err, "configSriovPFDevice(): fail to set mtu for PF", "device", iface.PciAddress)
return err
}
}
} else {
// Use netlink for non-IB devices
currentMtu = s.networkHelper.GetNetdevMTU(iface.PciAddress)
if iface.Mtu > currentMtu {
err = s.networkHelper.SetNetdevMTU(iface.PciAddress, iface.Mtu)
if err != nil {
log.Log.Error(err, "configSriovPFDevice(): fail to set mtu for PF", "device", iface.PciAddress)
return err
}
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic to determine the current MTU for an interface, which handles both InfiniBand and other link types, is duplicated between this function (configSriovPFDevice) and checkExternallyManagedPF (lines 645-658). To improve maintainability and avoid potential inconsistencies in the future, consider extracting this logic into a new helper method on the sriov struct, for example: getCurrentMtu(iface *sriovnetworkv1.Interface) (int, error).

- Create getCurrentMtu() helper function that properly handles IB devices
- Use sysfs for IB devices, netlink for others
- Replace duplicated MTU logic in configSriovPFDevice() and checkExternallyManagedPF()
- Ensure proper error handling for IB devices (return error instead of silently using 0)
- Remove duplicate comment "// set PF mtu"

Signed-off-by: Beni Ben Shabat <bbenshab@redhat.com>
@bbenshab
Copy link
Author

bbenshab commented Feb 8, 2026

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a workaround for a kernel netlink issue affecting InfiniBand devices with numerous VFs, which can cause "message too long" errors. The solution cleverly bypasses netlink for these devices by leveraging direct ip command execution for configuration and reading device state from sysfs. The changes are well-isolated to InfiniBand device handling, ensuring that other devices continue to use the existing netlink-based logic. The implementation is clean and adds several helper functions for sysfs interaction, improving modularity. I've included a few suggestions to enhance code clarity by removing redundant function calls and to address a potential bug related to a missing check.

Comment on lines +574 to +575
ifName := s.networkHelper.TryGetInterfaceName(iface.PciAddress)
if err := s.setMTUViaIPCommand(ifName, iface.Mtu); err != nil {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There's a potential issue here. s.networkHelper.TryGetInterfaceName can return an empty string, but there is no check for it before calling setMTUViaIPCommand. This will cause the ip command to fail. Please add a check for an empty interface name.

Also, iface.Name should already contain the interface name, making the call to TryGetInterfaceName redundant.

Comment on lines +135 to +138
ifName := s.networkHelper.TryGetInterfaceName(ifaceStatus.PciAddress)
if ifName == "" {
return fmt.Errorf("failed to get interface name for %s", ifaceStatus.PciAddress)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The interface name is already available in ifaceStatus.Name. Calling TryGetInterfaceName again is redundant. You can use ifaceStatus.Name directly.

Suggested change
ifName := s.networkHelper.TryGetInterfaceName(ifaceStatus.PciAddress)
if ifName == "" {
return fmt.Errorf("failed to get interface name for %s", ifaceStatus.PciAddress)
}
ifName := ifaceStatus.Name
if ifName == "" {
return fmt.Errorf("failed to get interface name for %s", ifaceStatus.PciAddress)
}

Comment on lines +323 to +326
ifName := s.networkHelper.TryGetInterfaceName(iface.PciAddress)
if ifName == "" {
return 0, fmt.Errorf("failed to get interface name for IB device %s", iface.PciAddress)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The interface name should already be available in iface.Name. Calling TryGetInterfaceName again is redundant. You can use iface.Name directly.

Suggested change
ifName := s.networkHelper.TryGetInterfaceName(iface.PciAddress)
if ifName == "" {
return 0, fmt.Errorf("failed to get interface name for IB device %s", iface.PciAddress)
}
ifName := iface.Name
if ifName == "" {
return 0, fmt.Errorf("failed to get interface name for IB device %s", iface.PciAddress)
}

- Replace TryGetInterfaceName() calls with direct .Name field access in ResetSriovDevice, getCurrentMtu, and configSriovPFDevice
- Add empty name check in configSriovPFDevice before calling setMTUViaIPCommand to prevent ip command failures
- Both Interface and InterfaceExt structs already have Name field populated, making TryGetInterfaceName redundant in these contexts

Signed-off-by: Beni Ben Shabat <bbenshab@redhat.com>
@bbenshab
Copy link
Author

bbenshab commented Feb 8, 2026

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a workaround for a kernel regression affecting InfiniBand devices with many VFs, which causes netlink messages to exceed PAGE_SIZE. The solution bypasses netlink for IB devices by using direct ip commands for configuration and reading device information from sysfs. A security audit confirmed no vulnerabilities meeting the required severity threshold were identified, and the implementation follows security best practices, including safe handling of os/exec to prevent command injection. My main feedback is a minor refactoring suggestion to reduce code duplication for better maintainability.

Comment on lines +403 to 438
if isIB {
// For InfiniBand devices with many VFs, avoid netlink query
log.Log.V(2).Info("DiscoverSriovDevices(): detected InfiniBand device, using sysfs instead of netlink",
"device", device.Address, "name", pfNetName)
// Read MTU and MAC directly from sysfs to avoid netlink
if mtu, err := s.getMTUFromSysfs(pfNetName); err == nil {
iface.Mtu = mtu
} else {
log.Log.V(2).Info("DiscoverSriovDevices(): failed to get MTU from sysfs for IB device", "name", pfNetName, "error", err)
}
if data, err := os.ReadFile(filepath.Join(vars.FilesystemRoot, consts.SysClassNet, pfNetName, "address")); err == nil {
iface.Mac = strings.TrimSpace(string(data))
} else {
log.Log.V(2).Info("DiscoverSriovDevices(): failed to get MAC from sysfs for IB device", "name", pfNetName, "error", err)
}
iface.LinkType = linkType
iface.LinkSpeed = s.networkHelper.GetNetDevLinkSpeed(pfNetName)
if adminState, err := s.getLinkAdminStateFromSysfs(pfNetName); err == nil {
iface.LinkAdminState = adminState
} else {
log.Log.V(2).Info("DiscoverSriovDevices(): failed to get link admin state from sysfs for IB device", "name", pfNetName, "error", err)
}
} else {
// For non-IB devices, use netlink as before
link, err := s.netlinkLib.LinkByName(pfNetName)
if err != nil {
log.Log.Error(err, "DiscoverSriovDevices(): unable to get Link for device, skipping", "device", device.Address)
continue
}

iface.Mtu = link.Attrs().MTU
iface.Mac = link.Attrs().HardwareAddr.String()
iface.LinkType = s.encapTypeToLinkType(link.Attrs().EncapType)
iface.LinkSpeed = s.networkHelper.GetNetDevLinkSpeed(pfNetName)
iface.LinkAdminState = s.networkHelper.GetNetDevLinkAdminState(pfNetName)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The call to s.networkHelper.GetNetDevLinkSpeed(pfNetName) is duplicated in both branches of the if isIB statement (lines 419 and 436). To improve maintainability and reduce code duplication, consider moving this call to after the if/else block.

…Fallback call

- Call getLinkTypeWithIBFallback once and reuse the linkType variable
- Inline MTU retrieval logic to avoid redundant function call
- Reduces overhead by avoiding duplicate link type detection

Signed-off-by: Beni Ben Shabat <bbenshab@redhat.com>
@bbenshab
Copy link
Author

bbenshab commented Feb 8, 2026

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a workaround for a kernel regression affecting InfiniBand devices with many VFs, which causes netlink queries to fail. The solution correctly bypasses netlink for these devices by using ip commands and reading from sysfs directly. The changes are well-structured, introducing new helper functions to encapsulate the new logic. Additionally, the GUID generation has been improved to use crypto/rand for better security. I have one suggestion to refactor a section of the code to improve maintainability by reducing duplication.

Comment on lines +564 to 604
if iface.Mtu > 0 {
linkType := s.getLinkTypeWithIBFallback(iface)
var currentMtu int
if strings.EqualFold(linkType, consts.LinkTypeIB) {
ifName := iface.Name
if ifName == "" {
err := fmt.Errorf("failed to get interface name for IB device %s", iface.PciAddress)
log.Log.Error(err, "configSriovPFDevice(): failed to get current MTU")
return err
}
mtu, err := s.getMTUFromSysfs(ifName)
if err != nil {
log.Log.Error(err, "configSriovPFDevice(): failed to get current MTU from sysfs", "device", iface.PciAddress)
return err
}
currentMtu = mtu
} else {
currentMtu = s.networkHelper.GetNetdevMTU(iface.PciAddress)
}

if iface.Mtu > currentMtu {
if strings.EqualFold(linkType, consts.LinkTypeIB) {
ifName := iface.Name
if ifName == "" {
err := fmt.Errorf("interface name is empty for device %s", iface.PciAddress)
log.Log.Error(err, "configSriovPFDevice(): fail to get interface name")
return err
}
if err := s.setMTUViaIPCommand(ifName, iface.Mtu); err != nil {
log.Log.Error(err, "configSriovPFDevice(): fail to set mtu for PF", "device", iface.PciAddress)
return err
}
} else {
err = s.networkHelper.SetNetdevMTU(iface.PciAddress, iface.Mtu)
if err != nil {
log.Log.Error(err, "configSriovPFDevice(): fail to set mtu for PF", "device", iface.PciAddress)
return err
}
}
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block of code duplicates the logic for getting the current MTU, which is already encapsulated in the getCurrentMtu function introduced in this same pull request. Using the existing helper function will reduce code duplication and improve maintainability.

	if iface.Mtu > 0 {
		currentMtu, err := s.getCurrentMtu(iface)
		if err != nil {
			log.Log.Error(err, "configSriovPFDevice(): failed to get current MTU", "device", iface.PciAddress)
			return err
		}

		if iface.Mtu > currentMtu {
			linkType := s.getLinkTypeWithIBFallback(iface)
			if strings.EqualFold(linkType, consts.LinkTypeIB) {
				if err := s.setMTUViaIPCommand(iface.Name, iface.Mtu); err != nil {
					log.Log.Error(err, "configSriovPFDevice(): fail to set mtu for PF", "device", iface.PciAddress)
					return err
				}
			} else {
				if err := s.networkHelper.SetNetdevMTU(iface.PciAddress, iface.Mtu); err != nil {
					log.Log.Error(err, "configSriovPFDevice(): fail to set mtu for PF", "device", iface.PciAddress)
					return err
				}
			}
		}
	}

- Replace duplicated MTU retrieval logic in configSriovPFDevice with getCurrentMtu call
- Improves maintainability by centralizing MTU logic in one place
- Simplifies error handling by reusing existing helper function

Signed-off-by: Beni Ben Shabat <bbenshab@redhat.com>
@bbenshab
Copy link
Author

bbenshab commented Feb 8, 2026

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a workaround for a kernel bug affecting InfiniBand devices by using ip commands instead of netlink calls to avoid PAGE_SIZE limitations. The changes are well-isolated to IB devices, which are now detected by reading from sysfs. New helper functions are introduced to interact with sysfs and execute ip commands. The GUID generation for IB VFs is also refactored. The changes are logical and well-implemented. I have one suggestion for a minor refactoring to reduce code duplication.

Comment on lines +572 to +583
linkType := s.getLinkTypeWithIBFallback(iface)
if strings.EqualFold(linkType, consts.LinkTypeIB) {
if err := s.setMTUViaIPCommand(iface.Name, iface.Mtu); err != nil {
log.Log.Error(err, "configSriovPFDevice(): fail to set mtu for PF", "device", iface.PciAddress)
return err
}
} else {
if err := s.networkHelper.SetNetdevMTU(iface.PciAddress, iface.Mtu); err != nil {
log.Log.Error(err, "configSriovPFDevice(): fail to set mtu for PF", "device", iface.PciAddress)
return err
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The error handling logic for setting the MTU is duplicated for the InfiniBand and non-InfiniBand cases. This can be refactored to reduce code duplication and improve readability.

			linkType := s.getLinkTypeWithIBFallback(iface)
			var err error
			if strings.EqualFold(linkType, consts.LinkTypeIB) {
				err = s.setMTUViaIPCommand(iface.Name, iface.Mtu)
			} else {
				err = s.networkHelper.SetNetdevMTU(iface.PciAddress, iface.Mtu)
			}
			if err != nil {
				log.Log.Error(err, "configSriovPFDevice(): fail to set mtu for PF", "device", iface.PciAddress)
				return err
			}

bbenshab added a commit to bbenshab/sriov-cni that referenced this pull request Feb 9, 2026
InfiniBand devices with 16+ VFs exceed the kernel's PAGE_SIZE limit
(~4KB) when querying via netlink. This causes 'message too long' errors
when the CNI plugin tries to configure VFs.

Root Cause:
- Kernel netlink responses are capped at PAGE_SIZE regardless of socket
  buffer configuration
- IB devices return extensive metadata (MAC, VLAN, Node GUID, Port GUID,
  RDMA stats) that exceeds this limit with many VFs
- The SR-IOV CNI plugin queries the PF link to apply/fill/reset VF config

Solution:
- Detect InfiniBand devices via sysfs /sys/class/net/<iface>/type
- Skip netlink queries for IB PFs when configuring VFs
- Read MTU from sysfs instead of netlink for IB devices
- Skip VF configuration operations that don't apply to IB (VLAN, spoofchk)

Changes:
1. pkg/utils/utils.go:
   - Added ARPHRD_ETHER and ARPHRD_INFINIBAND constants
   - Added IsInfiniBandDevice() function to detect IB via sysfs

2. pkg/sriov/sriov.go:
   - Added os, path/filepath, strconv, strings imports
   - Modified ApplyVFConfig() to skip netlink for IB devices
   - Modified FillOriginalVfInfo() to skip netlink for IB devices
   - Modified ResetVFConfig() to skip netlink for IB devices
   - Read MTU from sysfs for IB devices instead of netlink

This fix complements the sriov-network-operator fix (PR #1026) which
addressed the same issue in the config daemon.

Fixes: k8snetworkplumbingwg/sriov-network-operator#1026
Set VF policy to 'Follow' for InfiniBand devices so VFs inherit
PF link state. This is required for IB VFs to come up when the PF is up.

The policy is set via sysfs (/sys/class/net/<pf>/device/sriov/<vfid>/policy)
to avoid netlink PAGE_SIZE limit issues with IB devices that have many VFs.
@bbenshab bbenshab closed this Feb 9, 2026
@bbenshab
Copy link
Author

bbenshab commented Feb 9, 2026

Closed by mistake - reopening.

@bbenshab bbenshab reopened this Feb 9, 2026
@github-actions
Copy link

github-actions bot commented Feb 9, 2026

Thanks for your PR,
To run vendors CIs, Maintainers can use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs, Maintainers can use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@SchSeba
Copy link
Collaborator

SchSeba commented Mar 10, 2026

Hi @bbenshab let me know if we can close this one, as this is kernel regression I don't think we need to implement w/a in the operator just let the linux kernel to fix the regression.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants