Use ip commands to bypass netlink PAGE_SIZE limit for InfiniBand devices#1026
Use ip commands to bypass netlink PAGE_SIZE limit for InfiniBand devices#1026bbenshab wants to merge 6 commits intok8snetworkplumbingwg:masterfrom
Conversation
|
Thanks for your PR,
To skip the vendors CIs, Maintainers can use one of:
|
Summary of ChangesHello @bbenshab, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical issue where InfiniBand devices with multiple Virtual Functions (VFs) fail to configure due to a kernel regression limiting netlink message size. The solution implements a robust workaround by completely bypassing netlink for InfiniBand devices, opting instead for direct 'ip' command execution and sysfs interactions to manage VF GUIDs and PF link states. This ensures InfiniBand device functionality without waiting for a kernel fix. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request effectively addresses a kernel regression for InfiniBand devices by bypassing problematic netlink queries and using ip commands instead. The changes are well-contained and clearly explained. I have a couple of suggestions to enhance code robustness and maintainability by using standard library functions for path and data manipulation, and by refactoring a new function to reduce duplication.
pkg/host/internal/sriov/sriov.go
Outdated
| // For InfiniBand devices, use sysfs to avoid netlink PAGE_SIZE limit | ||
| if strings.EqualFold(iface.LinkType, consts.LinkTypeIB) { | ||
| // Check link state via sysfs | ||
| adminStatePath := fmt.Sprintf("/sys/class/net/%s/operstate", iface.Name) |
There was a problem hiding this comment.
Using filepath.Join is a safer way to construct file paths, as it handles path separators correctly. While fmt.Sprintf is likely safe here for a sysfs path, filepath.Join is more robust and considered a best practice for path construction.
| adminStatePath := fmt.Sprintf("/sys/class/net/%s/operstate", iface.Name) | |
| adminStatePath := filepath.Join("/sys/class/net", iface.Name, "operstate") |
pkg/host/internal/sriov/sriov.go
Outdated
| // Generate random GUID (8 bytes for InfiniBand) | ||
| guid := make([]byte, 8) | ||
| _, err := rand.Read(guid) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to generate random GUID: %v", err) | ||
| } | ||
|
|
||
| // Format GUID as hex string with colons | ||
| guidStr := fmt.Sprintf("%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x", | ||
| guid[0], guid[1], guid[2], guid[3], guid[4], guid[5], guid[6], guid[7]) | ||
|
|
||
| log.Log.Info("configureVfGUIDWithIPCommand(): set vf guid", "address", vfAddr, "guid", guidStr) | ||
|
|
||
| // Set node GUID using ip command | ||
| cmd := exec.Command("ip", "link", "set", "dev", pfName, "vf", strconv.Itoa(vfID), "node_guid", guidStr) | ||
| if output, err := cmd.CombinedOutput(); err != nil { | ||
| return fmt.Errorf("failed to set node_guid: %v, output: %s", err, string(output)) | ||
| } | ||
|
|
||
| // Set port GUID using ip command | ||
| cmd = exec.Command("ip", "link", "set", "dev", pfName, "vf", strconv.Itoa(vfID), "port_guid", guidStr) | ||
| if output, err := cmd.CombinedOutput(); err != nil { | ||
| return fmt.Errorf("failed to set port_guid: %v, output: %s", err, string(output)) | ||
| } |
There was a problem hiding this comment.
This function can be simplified and made more robust.
- The GUID formatting can be done more idiomatically using
net.HardwareAddr(guid).String(). This requires importing thenetpackage. - The two
ip link setcommands fornode_guidandport_guidare very similar and can be consolidated into a loop to reduce code duplication and improve maintainability.
| // Generate random GUID (8 bytes for InfiniBand) | |
| guid := make([]byte, 8) | |
| _, err := rand.Read(guid) | |
| if err != nil { | |
| return fmt.Errorf("failed to generate random GUID: %v", err) | |
| } | |
| // Format GUID as hex string with colons | |
| guidStr := fmt.Sprintf("%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x", | |
| guid[0], guid[1], guid[2], guid[3], guid[4], guid[5], guid[6], guid[7]) | |
| log.Log.Info("configureVfGUIDWithIPCommand(): set vf guid", "address", vfAddr, "guid", guidStr) | |
| // Set node GUID using ip command | |
| cmd := exec.Command("ip", "link", "set", "dev", pfName, "vf", strconv.Itoa(vfID), "node_guid", guidStr) | |
| if output, err := cmd.CombinedOutput(); err != nil { | |
| return fmt.Errorf("failed to set node_guid: %v, output: %s", err, string(output)) | |
| } | |
| // Set port GUID using ip command | |
| cmd = exec.Command("ip", "link", "set", "dev", pfName, "vf", strconv.Itoa(vfID), "port_guid", guidStr) | |
| if output, err := cmd.CombinedOutput(); err != nil { | |
| return fmt.Errorf("failed to set port_guid: %v, output: %s", err, string(output)) | |
| } | |
| // Generate random GUID (8 bytes for InfiniBand) | |
| guid := make([]byte, 8) | |
| _, err := rand.Read(guid) | |
| if err != nil { | |
| return fmt.Errorf("failed to generate random GUID: %v", err) | |
| } | |
| // Format GUID as hex string with colons. Note: requires importing "net". | |
| guidStr := net.HardwareAddr(guid).String() | |
| log.Log.Info("configureVfGUIDWithIPCommand(): set vf guid", "address", vfAddr, "guid", guidStr) | |
| for _, guidType := range []string{"node_guid", "port_guid"} { | |
| cmd := exec.Command("ip", "link", "set", "dev", pfName, "vf", strconv.Itoa(vfID), guidType, guidStr) | |
| if output, err := cmd.CombinedOutput(); err != nil { | |
| return fmt.Errorf("failed to set %s: %v, output: %s", guidType, err, string(output)) | |
| } | |
| } |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a workaround for a kernel netlink PAGE_SIZE limitation affecting InfiniBand devices with many VFs. The approach is to bypass netlink queries for IB devices and use ip link set commands directly for configuration, which is a sensible solution to unblock users.
The implementation is solid, but I've identified a few areas for improvement:
- A regression in functionality where the GUID pool for VFs is bypassed in the new code path.
- Inconsistent file path construction that could cause issues in chrooted environments.
- Opportunities to improve error handling by using Go's standard error wrapping.
I've left specific comments with suggestions to address these points. Once these are addressed, the PR should be in good shape.
pkg/host/internal/sriov/sriov.go
Outdated
| if strings.EqualFold(iface.LinkType, consts.LinkTypeIB) { | ||
| // Check link state via sysfs | ||
| adminStatePath := filepath.Join("/sys/class/net", iface.Name, "operstate") | ||
| data, err := os.ReadFile(adminStatePath) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to read link state: %v", err) | ||
| } | ||
| state := strings.TrimSpace(string(data)) | ||
| // Set link up if not already up (operstate can be "up", "down", "unknown", etc.) | ||
| if state != "up" && state != "unknown" { | ||
| cmd := exec.Command("ip", "link", "set", iface.Name, "up") | ||
| if output, err := cmd.CombinedOutput(); err != nil { | ||
| return fmt.Errorf("failed to set link up: %v, output: %s", err, string(output)) | ||
| } | ||
| } |
There was a problem hiding this comment.
This block has a few areas for improvement:
- File Path Construction: The path to
operstateis constructed with a hardcoded"/sys/class/net"and withoutvars.FilesystemRoot. This is inconsistent with other parts of the codebase and could lead to issues in chrooted or testing environments. It should usefilepath.Join(vars.FilesystemRoot, consts.SysClassNet, ...). The constantconsts.SysClassNetis available. - Error Wrapping: The errors from
os.ReadFileandcmd.CombinedOutputare not wrapped using%w. Wrapping errors provides more context for debugging.
Here's a suggested implementation that addresses these points:
if strings.EqualFold(iface.LinkType, consts.LinkTypeIB) {
// Check link state via sysfs
adminStatePath := filepath.Join(vars.FilesystemRoot, consts.SysClassNet, iface.Name, "operstate")
data, err := os.ReadFile(adminStatePath)
if err != nil {
return fmt.Errorf("failed to read link state for %s: %w", iface.Name, err)
}
state := strings.TrimSpace(string(data))
// Set link up if not already up (operstate can be "up", "down", "unknown", etc.)
if state != "up" && state != "unknown" {
cmd := exec.Command("ip", "link", "set", iface.Name, "up")
if output, err := cmd.CombinedOutput(); err != nil {
return fmt.Errorf("failed to set link %s up: %w, output: %s", iface.Name, err, string(output))
}
}
}| func (s *sriov) configureVfGUIDWithIPCommand(vfAddr string, pfAddr string, pfName string, vfID int) error { | ||
| log.Log.V(2).Info("configureVfGUIDWithIPCommand(): configure vf guid using ip command", | ||
| "vfAddr", vfAddr, "pfAddr", pfAddr, "pfName", pfName, "vfID", vfID) | ||
|
|
||
| // Generate random GUID (8 bytes for InfiniBand) | ||
| guid := make([]byte, 8) | ||
| _, err := rand.Read(guid) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to generate random GUID: %v", err) | ||
| } | ||
|
|
||
| // Format GUID as hex string with colons using net.HardwareAddr | ||
| guidStr := net.HardwareAddr(guid).String() | ||
|
|
||
| log.Log.Info("configureVfGUIDWithIPCommand(): set vf guid", "address", vfAddr, "guid", guidStr) | ||
|
|
||
| // Set both node_guid and port_guid using ip command | ||
| for _, guidType := range []string{"node_guid", "port_guid"} { | ||
| cmd := exec.Command("ip", "link", "set", "dev", pfName, "vf", strconv.Itoa(vfID), guidType, guidStr) | ||
| if output, err := cmd.CombinedOutput(); err != nil { | ||
| return fmt.Errorf("failed to set %s: %v, output: %s", guidType, err, string(output)) | ||
| } | ||
| } | ||
|
|
||
| return nil | ||
| } |
There was a problem hiding this comment.
This function always generates a random GUID, which bypasses the GUID pool logic present in the netlink-based configuration path (s.infinibandHelper.ConfigureVfGUID). If a user has configured a GUID pool via InfinibandGUIDConfigFilePath, it will be ignored for InfiniBand devices when this workaround is active. This is a regression in functionality.
To address this, the GUID acquisition logic (checking the pool or generating a random one) should be centralized. A potential solution is to move this function's logic into the infinibandHelper and have it access the guidPool before falling back to random generation. This would ensure consistent behavior across both configuration paths.
pkg/host/internal/sriov/sriov.go
Outdated
| _, err := rand.Read(guid) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to generate random GUID: %v", err) | ||
| } | ||
|
|
||
| // Format GUID as hex string with colons using net.HardwareAddr | ||
| guidStr := net.HardwareAddr(guid).String() | ||
|
|
||
| log.Log.Info("configureVfGUIDWithIPCommand(): set vf guid", "address", vfAddr, "guid", guidStr) | ||
|
|
||
| // Set both node_guid and port_guid using ip command | ||
| for _, guidType := range []string{"node_guid", "port_guid"} { | ||
| cmd := exec.Command("ip", "link", "set", "dev", pfName, "vf", strconv.Itoa(vfID), guidType, guidStr) | ||
| if output, err := cmd.CombinedOutput(); err != nil { | ||
| return fmt.Errorf("failed to set %s: %v, output: %s", guidType, err, string(output)) | ||
| } | ||
| } |
There was a problem hiding this comment.
The errors returned from rand.Read and cmd.CombinedOutput are not being wrapped. Using %w with fmt.Errorf is a Go best practice that preserves the original error, which is valuable for debugging and error handling further up the call stack.
_, err := rand.Read(guid)
if err != nil {
return fmt.Errorf("failed to generate random GUID: %w", err)
}
// Format GUID as hex string with colons using net.HardwareAddr
guidStr := net.HardwareAddr(guid).String()
log.Log.Info("configureVfGUIDWithIPCommand(): set vf guid", "address", vfAddr, "guid", guidStr)
// Set both node_guid and port_guid using ip command
for _, guidType := range []string{"node_guid", "port_guid"} {
cmd := exec.Command("ip", "link", "set", "dev", pfName, "vf", strconv.Itoa(vfID), guidType, guidStr)
if output, err := cmd.CombinedOutput(); err != nil {
return fmt.Errorf("failed to set %s: %w, output: %s", guidType, err, string(output))
}
}|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a workaround for a kernel bug affecting InfiniBand devices by using ip commands instead of netlink to avoid PAGE_SIZE limitations. The changes are well-structured and properly isolated to InfiniBand device handling. I have a couple of suggestions to improve logging and error handling for better maintainability and debugging.
| if err != nil { | ||
| log.Log.Info("ConfigureVfGUID(): failed to get GUID from IB GUID pool", "address", vfAddr, "error", err) | ||
| return err | ||
| log.Log.Info("GetVfGUID(): failed to get GUID from IB GUID pool, using random GUID", "pfAddr", pfAddr, "vfID", vfID, "error", err) | ||
| return nil, err | ||
| } |
There was a problem hiding this comment.
The log message at line 46 is misleading. It states "using random GUID", but the function returns an error instead of falling back to the randomly generated GUID. This can make debugging difficult. The log level should also be Error since an error is being handled and returned, which is a better practice for error logging.
| if err != nil { | |
| log.Log.Info("ConfigureVfGUID(): failed to get GUID from IB GUID pool", "address", vfAddr, "error", err) | |
| return err | |
| log.Log.Info("GetVfGUID(): failed to get GUID from IB GUID pool, using random GUID", "pfAddr", pfAddr, "vfID", vfID, "error", err) | |
| return nil, err | |
| } | |
| if err != nil { | |
| log.Log.Error(err, "GetVfGUID(): failed to get GUID from IB GUID pool", "pfAddr", pfAddr, "vfID", vfID) | |
| return nil, err | |
| } |
pkg/host/internal/sriov/sriov.go
Outdated
| if state != "up" && state != "unknown" { | ||
| cmd := exec.Command("ip", "link", "set", iface.Name, "up") | ||
| if output, err := cmd.CombinedOutput(); err != nil { | ||
| return fmt.Errorf("failed to set link up: %v, output: %s", err, string(output)) |
There was a problem hiding this comment.
For consistency with other error handling in this file (e.g., line 1312) and to follow Go best practices for error wrapping, please use %w instead of %v. This allows the error to be unwrapped using errors.Is or errors.As, which can be useful for programmatic error handling.
| return fmt.Errorf("failed to set link up: %v, output: %s", err, string(output)) | |
| return fmt.Errorf("failed to set link up: %w, output: %s", err, string(output)) |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a workaround for a kernel regression affecting InfiniBand devices by using ip commands instead of netlink queries to avoid message size limits. The changes are well-structured and correctly isolate the logic for InfiniBand devices. However, I've identified a critical issue in the random GUID generation logic. It uses math/rand without proper seeding, which will lead to predictable GUIDs and likely cause collisions in a distributed environment. I've provided a detailed comment and a code suggestion to address this by using the cryptographically secure crypto/rand package.
| func (i *infiniband) GetVfGUID(pfAddr string, vfID int) (net.HardwareAddr, error) { | ||
| guid := generateRandomGUID() | ||
|
|
||
| if i.guidPool != nil { | ||
| guidFromPool, err := i.guidPool.GetVFGUID(pfAddr, vfID) | ||
| if err != nil { | ||
| log.Log.Info("ConfigureVfGUID(): failed to get GUID from IB GUID pool", "address", vfAddr, "error", err) | ||
| return err | ||
| log.Log.Error(err, "GetVfGUID(): failed to get GUID from IB GUID pool", "pfAddr", pfAddr, "vfID", vfID) | ||
| return nil, err | ||
| } | ||
| guid = guidFromPool | ||
| } | ||
|
|
||
| return guid, nil | ||
| } |
There was a problem hiding this comment.
The current implementation for random GUID generation relies on generateRandomGUID(), which uses math/rand. This is problematic for two main reasons:
- Predictable Sequences: If
math/randis not seeded, it will produce the same sequence of "random" numbers on every program start, leading to identical GUIDs being generated across different nodes and causing network collisions. - Collision Risk: Even if seeded (e.g., with the current time),
math/randis not cryptographically secure and is not suitable for generating unique identifiers where collision avoidance is critical.
For generating unique identifiers like GUIDs, crypto/rand should be used as it provides cryptographically secure random numbers.
I suggest refactoring this function to use crypto/rand and to only generate a random GUID when a GUID cannot be retrieved from the pool. This also avoids unnecessary work.
Note: The suggestion requires importing crypto/rand and fmt.
func (i *infiniband) GetVfGUID(pfAddr string, vfID int) (net.HardwareAddr, error) {
if i.guidPool != nil {
guidFromPool, err := i.guidPool.GetVFGUID(pfAddr, vfID)
if err != nil {
log.Log.Error(err, "GetVfGUID(): failed to get GUID from IB GUID pool", "pfAddr", pfAddr, "vfID", vfID)
return nil, err
}
return guidFromPool, nil
}
// Fallback to random GUID generation if pool is not available.
// Using crypto/rand for cryptographically secure random numbers to avoid collisions.
guid := make(net.HardwareAddr, 8)
if _, err := rand.Read(guid); err != nil {
return nil, fmt.Errorf("failed to generate random GUID: %w", err)
}
// Set U/L bit to indicate locally administered address.
guid[0] |= 0x02
// Unset I/G bit for unicast.
guid[0] &^= 0x01
return guid, nil
}|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a workaround for a kernel bug that limits netlink message sizes, affecting InfiniBand devices with many VFs. The approach of using ip commands directly instead of netlink for IB devices is a solid strategy. The code changes are well-structured, with new logic properly encapsulated.
My review has identified a potential issue where the workaround might not be applied if the linkType is not explicitly set in the SriovNetworkNodePolicy, as the device type detection relies on this optional field. I've suggested a more robust detection method using sysfs to ensure the workaround is always effective for IB devices. Additionally, I've pointed out a unit test that needs to be updated to correctly validate the new code path for IB configuration.
Overall, this is a good change that addresses a critical issue. Once the suggested improvements are incorporated, it should be ready for merging.
| var pfLink netlinkPkg.Link | ||
| if strings.EqualFold(iface.LinkType, consts.LinkTypeIB) { | ||
| log.Log.V(2).Info("configSriovVFDevices(): skipping netlink query for InfiniBand device, will use ip commands", | ||
| "device", iface.PciAddress, "linkType", iface.LinkType) | ||
| // pfLink remains nil for IB - we'll use ip commands instead | ||
| } else { | ||
| pfLink, err = s.netlinkLib.LinkByName(iface.Name) | ||
| if err != nil { | ||
| log.Log.Error(err, "configSriovVFDevices(): unable to get PF link for device", "device", iface) | ||
| return err | ||
| } | ||
| } |
There was a problem hiding this comment.
This logic relies on iface.LinkType from the user's policy to identify an InfiniBand device. If the user doesn't specify this optional field, the code will fall back to using netlinkLib.LinkByName(), which is the very call this PR aims to avoid for IB devices, potentially re-introducing the "message too long" error.
To make this workaround more robust, I suggest detecting the link type from sysfs (/sys/class/net/<iface>/type) if it's not specified in the policy. This avoids the problematic netlink call while ensuring IB devices are always handled correctly.
// Determine link type, preferring the policy spec but falling back to sysfs.
// This is to ensure we apply the IB workaround even if linkType is not set in the policy.
linkType := iface.LinkType
if linkType == "" {
// Attempt to read from /sys/class/net/<iface>/type to avoid netlink call
typePath := filepath.Join(vars.FilesystemRoot, consts.SysClassNet, iface.Name, "type")
if data, err := os.ReadFile(typePath); err == nil {
if typeInt, err := strconv.Atoi(strings.TrimSpace(string(data))); err == nil {
if typeInt == syscall.ARPHRD_INFINIBAND {
linkType = consts.LinkTypeIB
}
}
}
}
var pfLink netlinkPkg.Link
var err error
if strings.EqualFold(linkType, consts.LinkTypeIB) {
log.Log.V(2).Info("configSriovVFDevices(): skipping netlink query for InfiniBand device, will use ip commands",
"device", iface.PciAddress, "linkType", linkType)
// pfLink remains nil for IB - we'll use ip commands instead
} else {
pfLink, err = s.netlinkLib.LinkByName(iface.Name)
if err != nil {
log.Log.Error(err, "configSriovVFDevices(): unable to get PF link for device", "device", iface)
return err
}
}| // For InfiniBand devices, use sysfs to avoid netlink PAGE_SIZE limit | ||
| if strings.EqualFold(iface.LinkType, consts.LinkTypeIB) { | ||
| // Check link state via sysfs | ||
| adminStatePath := filepath.Join(vars.FilesystemRoot, consts.SysClassNet, iface.Name, "operstate") | ||
| data, err := os.ReadFile(adminStatePath) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to read link state: %w", err) | ||
| } | ||
| state := strings.TrimSpace(string(data)) | ||
| // Set link up if not already up (operstate can be "up", "down", "unknown", etc.) | ||
| if state != "up" && state != "unknown" { | ||
| cmd := exec.Command("ip", "link", "set", iface.Name, "up") | ||
| if output, err := cmd.CombinedOutput(); err != nil { | ||
| return fmt.Errorf("failed to set link up: %w, output: %s", err, string(output)) | ||
| } | ||
| } | ||
| } else { | ||
| // For non-IB devices, use standard netlink path | ||
| pfLink, err := s.netlinkLib.LinkByName(iface.Name) | ||
| if err != nil { | ||
| return err | ||
| } | ||
| if !s.netlinkLib.IsLinkAdminStateUp(pfLink) { | ||
| err = s.netlinkLib.LinkSetUp(pfLink) | ||
| if err != nil { | ||
| return err | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Similar to my other comment, this logic relies on iface.LinkType from the user's policy. If linkType is not specified for an InfiniBand device, this code will take the else path and use netlink, which could trigger the kernel bug this PR is trying to work around.
To make this more robust, we should determine the link type by checking sysfs if it's not provided in the policy, before deciding which path to take.
// Set PF link up
// Determine link type, preferring the policy spec but falling back to sysfs.
linkType := iface.LinkType
if linkType == "" {
// Attempt to read from /sys/class/net/<iface>/type to avoid netlink call
typePath := filepath.Join(vars.FilesystemRoot, consts.SysClassNet, iface.Name, "type")
if data, err := os.ReadFile(typePath); err == nil {
if typeInt, err := strconv.Atoi(strings.TrimSpace(string(data))); err == nil {
if typeInt == syscall.ARPHRD_INFINIBAND {
linkType = consts.LinkTypeIB
}
}
}
}
// For InfiniBand devices, use sysfs to avoid netlink PAGE_SIZE limit
if strings.EqualFold(linkType, consts.LinkTypeIB) {
// Check link state via sysfs
adminStatePath := filepath.Join(vars.FilesystemRoot, consts.SysClassNet, iface.Name, "operstate")
data, err := os.ReadFile(adminStatePath)
if err != nil {
return fmt.Errorf("failed to read link state: %w", err)
}
state := strings.TrimSpace(string(data))
// Set link up if not already up (operstate can be "up", "down", "unknown", etc.)
if state != "up" && state != "unknown" {
cmd := exec.Command("ip", "link", "set", iface.Name, "up")
if output, err := cmd.CombinedOutput(); err != nil {
return fmt.Errorf("failed to set link up: %w, output: %s", err, string(output))
}
}
} else {
// For non-IB devices, use standard netlink path
pfLink, err := s.netlinkLib.LinkByName(iface.Name)
if err != nil {
return err
}
if !s.netlinkLib.IsLinkAdminStateUp(pfLink) {
err = s.netlinkLib.LinkSetUp(pfLink)
if err != nil {
return err
}
}
}|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a workaround for a kernel regression that affects InfiniBand devices with a large number of VFs. The changes bypass problematic netlink queries by using ip commands and sysfs reads directly for IB device configuration. The approach is well-isolated and seems correct.
The refactoring of GUID generation into a separate GetVfGUID function using crypto/rand is a good improvement.
I've identified one area of code duplication in pkg/host/internal/sriov/sriov.go where the logic to determine the link type is repeated. I've left a comment suggesting to refactor this into a helper method to improve maintainability.
Overall, this is a solid contribution to address a tricky issue.
pkg/host/internal/sriov/sriov.go
Outdated
| // Determine link type, preferring the policy spec but falling back to sysfs. | ||
| // This ensures we apply the IB workaround even if linkType is not set in the policy. | ||
| linkType := iface.LinkType | ||
| if linkType == "" { | ||
| // Attempt to read from /sys/class/net/<iface>/type to avoid netlink call | ||
| typePath := filepath.Join(vars.FilesystemRoot, consts.SysClassNet, iface.Name, "type") | ||
| if data, err := os.ReadFile(typePath); err == nil { | ||
| if typeInt, err := strconv.Atoi(strings.TrimSpace(string(data))); err == nil { | ||
| if typeInt == ARPHRD_INFINIBAND { | ||
| linkType = consts.LinkTypeIB | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
This logic for determining the link type by checking the spec and falling back to sysfs is duplicated in configSriovDevice (lines 702-713). To improve maintainability and reduce code duplication, consider extracting this logic into a new helper method on the sriov struct.
For example, you could add:
// getLinkTypeWithIBFallback determines the link type for an interface.
// It prefers the type from the spec, but falls back to checking sysfs for InfiniBand.
func (s *sriov) getLinkTypeWithIBFallback(iface *sriovnetworkv1.Interface) string {
linkType := iface.LinkType
if linkType == "" {
// Attempt to read from /sys/class/net/<iface>/type to avoid netlink call
typePath := filepath.Join(vars.FilesystemRoot, consts.SysClassNet, iface.Name, "type")
if data, err := os.ReadFile(typePath); err == nil {
if typeInt, err := strconv.Atoi(strings.TrimSpace(string(data))); err == nil {
if typeInt == ARPHRD_INFINIBAND {
linkType = consts.LinkTypeIB
}
}
}
}
return linkType
}Then, you can replace this block and the one in configSriovDevice with a single call:
linkType := s.getLinkTypeWithIBFallback(iface)
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a workaround for a kernel regression that causes issues with InfiniBand devices having a large number of Virtual Functions. The changes cleverly bypass problematic netlink queries for these devices by using ip commands and direct sysfs reads, which is a well-isolated and effective solution. The code has been refactored to accommodate this, for example, by extracting the GUID generation logic into a reusable function. My review includes a suggestion to improve the readability of a new helper function. Overall, the changes are well-implemented and address the problem effectively.
| func (s *sriov) getLinkTypeWithIBFallback(iface *sriovnetworkv1.Interface) string { | ||
| linkType := iface.LinkType | ||
| if linkType == "" { | ||
| // Attempt to read from /sys/class/net/<iface>/type to avoid netlink call | ||
| typePath := filepath.Join(vars.FilesystemRoot, consts.SysClassNet, iface.Name, "type") | ||
| if data, err := os.ReadFile(typePath); err == nil { | ||
| if typeInt, err := strconv.Atoi(strings.TrimSpace(string(data))); err == nil { | ||
| if typeInt == ARPHRD_INFINIBAND { | ||
| linkType = consts.LinkTypeIB | ||
| } | ||
| } | ||
| } | ||
| } | ||
| return linkType | ||
| } |
There was a problem hiding this comment.
The nested if statements in this function make it a bit difficult to follow. Refactoring to use early returns would flatten the structure and improve readability and maintainability. Additionally, adding verbose logging for cases where reading or parsing the link type from sysfs fails could be beneficial for future debugging, even if the errors are intentionally ignored in this fallback mechanism.
func (s *sriov) getLinkTypeWithIPFallback(iface *sriovnetworkv1.Interface) string {
if iface.LinkType != "" {
return iface.LinkType
}
// Attempt to read from /sys/class/net/<iface>/type to avoid netlink call
typePath := filepath.Join(vars.FilesystemRoot, consts.SysClassNet, iface.Name, "type")
data, err := os.ReadFile(typePath)
if err != nil {
log.Log.V(2).Info("getLinkTypeWithIBFallback(): failed to read link type from sysfs, ignoring", "path", typePath, "error", err)
return ""
}
typeInt, err := strconv.Atoi(strings.TrimSpace(string(data)))
if err != nil {
log.Log.V(2).Info("getLinkTypeWithIBFallback(): failed to parse link type from sysfs, ignoring", "path", typePath, "data", string(data), "error", err)
return ""
}
if typeInt == ARPHRD_INFINIBAND {
return consts.LinkTypeIB
}
return ""
}|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a workaround for a kernel netlink PAGE_SIZE limitation affecting InfiniBand devices with a large number of virtual functions. The changes bypass netlink queries for these devices, opting instead for direct ip command execution and sysfs reads to configure VF GUIDs and manage link states.
The approach is well-contained and effectively addresses the issue. The code is clear and the refactoring in infiniband.go improves modularity.
I have one suggestion to improve the robustness of the link state check for InfiniBand devices. Additionally, while manual testing has been performed, I recommend adding unit tests for the new code paths that execute ip commands to ensure long-term maintainability.
pkg/host/internal/sriov/sriov.go
Outdated
| // Check link state via sysfs | ||
| adminStatePath := filepath.Join(vars.FilesystemRoot, consts.SysClassNet, iface.Name, "operstate") | ||
| data, err := os.ReadFile(adminStatePath) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to read link state: %w", err) | ||
| } | ||
| state := strings.TrimSpace(string(data)) | ||
| // Set link up if not already up (operstate can be "up", "down", "unknown", etc.) | ||
| if state != "up" && state != "unknown" { | ||
| cmd := exec.Command("ip", "link", "set", iface.Name, "up") | ||
| if output, err := cmd.CombinedOutput(); err != nil { | ||
| return fmt.Errorf("failed to set link up: %w, output: %s", err, string(output)) | ||
| } | ||
| } |
There was a problem hiding this comment.
The current implementation checks the operstate to decide whether to bring the interface up. This differs from the original logic for non-IB devices, which checks the administrative state (IFF_UP flag). Using operstate can be unreliable as its value can be unknown for some drivers, even if the interface is administratively down. In such cases, the interface would not be brought up, which is a change from the original behavior.
To align with the original logic and make the check more robust, I suggest reading the administrative flags from /sys/class/net/<iface>/flags and checking for the IFF_UP bit.
| // Check link state via sysfs | |
| adminStatePath := filepath.Join(vars.FilesystemRoot, consts.SysClassNet, iface.Name, "operstate") | |
| data, err := os.ReadFile(adminStatePath) | |
| if err != nil { | |
| return fmt.Errorf("failed to read link state: %w", err) | |
| } | |
| state := strings.TrimSpace(string(data)) | |
| // Set link up if not already up (operstate can be "up", "down", "unknown", etc.) | |
| if state != "up" && state != "unknown" { | |
| cmd := exec.Command("ip", "link", "set", iface.Name, "up") | |
| if output, err := cmd.CombinedOutput(); err != nil { | |
| return fmt.Errorf("failed to set link up: %w, output: %s", err, string(output)) | |
| } | |
| } | |
| // Check link admin state via sysfs to mirror the netlink behavior more closely. | |
| // The original code checked `IsLinkAdminStateUp`, which corresponds to the IFF_UP flag. | |
| flagsPath := filepath.Join(vars.FilesystemRoot, consts.SysClassNet, iface.Name, "flags") | |
| data, err := os.ReadFile(flagsPath) | |
| if err != nil { | |
| return fmt.Errorf("failed to read link flags for %s: %w", iface.Name, err) | |
| } | |
| flags, err := strconv.ParseInt(strings.TrimSpace(string(data)), 0, 64) | |
| if err != nil { | |
| return fmt.Errorf("failed to parse link flags for %s ('%s'): %w", iface.Name, string(data), err) | |
| } | |
| // IFF_UP is the first bit (0x1) | |
| isUp := (flags & 0x1) != 0 | |
| if !isUp { | |
| cmd := exec.Command("ip", "link", "set", iface.Name, "up") | |
| if output, err := cmd.CombinedOutput(); err != nil { | |
| return fmt.Errorf("failed to set link up for %s: %w, output: %s", iface.Name, err, string(output)) | |
| } | |
| } |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a workaround for a kernel bug that causes issues with InfiniBand devices having a large number of Virtual Functions. The changes cleverly bypass problematic netlink calls by using ip commands and direct sysfs reads for device configuration, specifically for IB devices. The code is well-structured, with clear separation of concerns, such as the new GetVfGUID function. The changes are isolated to InfiniBand devices, ensuring that other devices continue to function as before. Overall, this is a solid implementation of the described solution. I have one minor suggestion to improve code clarity.
pkg/host/internal/sriov/sriov.go
Outdated
| return fmt.Errorf("failed to parse link flags for %s ('%s'): %w", iface.Name, string(data), err) | ||
| } | ||
| // IFF_UP is the first bit (0x1) | ||
| isUp := (flags & 0x1) != 0 |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a workaround for a kernel bug affecting InfiniBand devices by avoiding netlink calls and using ip commands directly. The changes are well-structured, introducing specific logic for IB devices while keeping the existing paths for other device types. The refactoring of GUID generation is also a good improvement. I've found one critical issue where a variable is shadowed, which accidentally re-introduces the netlink call this PR aims to remove. I've also included a suggestion to improve performance. Once the critical issue is addressed, the changes look solid.
| // exceed this limit. We'll use 'ip link set' commands directly instead. | ||
| // See: https://issues.redhat.com/browse/OCPBUGS-74637 | ||
|
|
||
| linkType := s.getLinkTypeWithIBFallback(iface) |
There was a problem hiding this comment.
This linkType is correctly determined without using netlink. However, it is being shadowed by another linkType variable declared on line 604 within the loop. That inner linkType is determined using s.GetLinkType(), which makes a netlink call and re-introduces the bug this PR is trying to fix.
Please remove the variable shadowing on lines 604-607 to ensure this linkType is used throughout the function.
pkg/host/internal/sriov/sriov.go
Outdated
| log.Log.Info("configureVfGUIDWithIPCommand(): set vf guid", "address", vfAddr, "guid", guidStr) | ||
|
|
||
| // Set both node_guid and port_guid using ip command | ||
| for _, guidType := range []string{"node_guid", "port_guid"} { |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a workaround for a kernel netlink PAGE_SIZE limitation affecting InfiniBand devices with many VFs, by bypassing netlink for IB devices and using sysfs for discovery and ip link commands for configuration. A security audit found no significant vulnerabilities, highlighting positive aspects such as secure GUID generation using crypto/rand, safe command execution via os/exec, and robust sysfs path handling. I've identified a couple of areas for improvement regarding code duplication and the use of a magic number, for which specific comments have been left. Overall, this is a solid and effective contribution.
| if typeInt == ARPHRD_INFINIBAND { | ||
| return consts.LinkTypeIB | ||
| } else if typeInt == 1 { // ARPHRD_ETHER | ||
| return consts.LinkTypeETH | ||
| } |
There was a problem hiding this comment.
To improve readability and maintainability, it's better to use a named constant for the magic number 1. Please consider defining ARPHRD_ETHER = 1 (from if_arp.h) alongside ARPHRD_INFINIBAND and using the constant here as suggested.
| if typeInt == ARPHRD_INFINIBAND { | |
| return consts.LinkTypeIB | |
| } else if typeInt == 1 { // ARPHRD_ETHER | |
| return consts.LinkTypeETH | |
| } | |
| if typeInt == ARPHRD_INFINIBAND { | |
| return consts.LinkTypeIB | |
| } else if typeInt == ARPHRD_ETHER { | |
| return consts.LinkTypeETH | |
| } |
pkg/host/internal/sriov/sriov.go
Outdated
| var iface sriovnetworkv1.InterfaceExt | ||
|
|
||
| if isIB { | ||
| // For InfiniBand devices with many VFs, avoid netlink query | ||
| log.Log.V(2).Info("DiscoverSriovDevices(): detected InfiniBand device, using sysfs instead of netlink", | ||
| "device", device.Address, "name", pfNetName) | ||
| iface = sriovnetworkv1.InterfaceExt{ | ||
| Name: pfNetName, | ||
| PciAddress: device.Address, | ||
| Driver: driver, | ||
| Vendor: device.Vendor.ID, | ||
| DeviceID: device.Product.ID, | ||
| Mtu: s.networkHelper.GetNetdevMTU(device.Address), | ||
| Mac: s.networkHelper.GetNetDevMac(pfNetName), | ||
| LinkType: linkType, | ||
| LinkSpeed: s.networkHelper.GetNetDevLinkSpeed(pfNetName), | ||
| LinkAdminState: s.networkHelper.GetNetDevLinkAdminState(pfNetName), | ||
| } | ||
| } else { | ||
| // For non-IB devices, use netlink as before | ||
| link, err := s.netlinkLib.LinkByName(pfNetName) | ||
| if err != nil { | ||
| log.Log.Error(err, "DiscoverSriovDevices(): unable to get Link for device, skipping", "device", device.Address) | ||
| continue | ||
| } | ||
|
|
||
| iface := sriovnetworkv1.InterfaceExt{ | ||
| Name: pfNetName, | ||
| PciAddress: device.Address, | ||
| Driver: driver, | ||
| Vendor: device.Vendor.ID, | ||
| DeviceID: device.Product.ID, | ||
| Mtu: link.Attrs().MTU, | ||
| Mac: link.Attrs().HardwareAddr.String(), | ||
| LinkType: s.encapTypeToLinkType(link.Attrs().EncapType), | ||
| LinkSpeed: s.networkHelper.GetNetDevLinkSpeed(pfNetName), | ||
| LinkAdminState: s.networkHelper.GetNetDevLinkAdminState(pfNetName), | ||
| iface = sriovnetworkv1.InterfaceExt{ | ||
| Name: pfNetName, | ||
| PciAddress: device.Address, | ||
| Driver: driver, | ||
| Vendor: device.Vendor.ID, | ||
| DeviceID: device.Product.ID, | ||
| Mtu: link.Attrs().MTU, | ||
| Mac: link.Attrs().HardwareAddr.String(), | ||
| LinkType: s.encapTypeToLinkType(link.Attrs().EncapType), | ||
| LinkSpeed: s.networkHelper.GetNetDevLinkSpeed(pfNetName), | ||
| LinkAdminState: s.networkHelper.GetNetDevLinkAdminState(pfNetName), | ||
| } | ||
| } |
There was a problem hiding this comment.
There is significant code duplication in the if/else block for populating the iface struct. You can refactor this by initializing the common fields first, and then setting the differing fields (Mtu, Mac, LinkType) inside the conditional blocks. This will make the code more concise and easier to maintain.
iface := sriovnetworkv1.InterfaceExt{
Name: pfNetName,
PciAddress: device.Address,
Driver: driver,
Vendor: device.Vendor.ID,
DeviceID: device.Product.ID,
LinkSpeed: s.networkHelper.GetNetDevLinkSpeed(pfNetName),
LinkAdminState: s.networkHelper.GetNetDevLinkAdminState(pfNetName),
}
if isIB {
// For InfiniBand devices with many VFs, avoid netlink query
log.Log.V(2).Info("DiscoverSriovDevices(): detected InfiniBand device, using sysfs instead of netlink",
"device", device.Address, "name", pfNetName)
iface.Mtu = s.networkHelper.GetNetdevMTU(device.Address)
iface.Mac = s.networkHelper.GetNetDevMac(pfNetName)
iface.LinkType = linkType
} else {
// For non-IB devices, use netlink as before
link, err := s.netlinkLib.LinkByName(pfNetName)
if err != nil {
log.Log.Error(err, "DiscoverSriovDevices(): unable to get Link for device, skipping", "device", device.Address)
continue
}
iface.Mtu = link.Attrs().MTU
iface.Mac = link.Attrs().HardwareAddr.String()
iface.LinkType = s.encapTypeToLinkType(link.Attrs().EncapType)
}|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a workaround for a kernel regression that limits netlink message sizes, causing issues with InfiniBand devices that have many VFs. The solution bypasses netlink for IB devices by using ip commands and reading from sysfs directly. However, it introduces potential path traversal vulnerabilities by using unvalidated interface names from the Custom Resource spec to construct file paths on the host. These names should be strictly validated or matched against discovered system devices before being used in file operations. Additionally, there is a lack of unit tests for these new critical code paths; adding tests would significantly increase confidence in this workaround.
| func (s *sriov) configureVfGUIDWithIPCommand(vfAddr string, pfAddr string, pfName string, vfID int) error { | ||
| log.Log.V(2).Info("configureVfGUIDWithIPCommand(): configure vf guid using ip command", | ||
| "vfAddr", vfAddr, "pfAddr", pfAddr, "pfName", pfName, "vfID", vfID) | ||
|
|
||
| // Get GUID from pool or generate random (same logic as ConfigureVfGUID) | ||
| guid, err := s.infinibandHelper.GetVfGUID(pfAddr, vfID) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to get VF GUID: %w", err) | ||
| } | ||
|
|
||
| // Format GUID as hex string with colons using net.HardwareAddr | ||
| guidStr := guid.String() | ||
|
|
||
| log.Log.Info("configureVfGUIDWithIPCommand(): set vf guid", "address", vfAddr, "guid", guidStr) | ||
|
|
||
| // Set both node_guid and port_guid using ip command | ||
| for _, guidType := range ibGUIDTypes { | ||
| cmd := exec.Command("ip", "link", "set", "dev", pfName, "vf", strconv.Itoa(vfID), guidType, guidStr) | ||
| if output, err := cmd.CombinedOutput(); err != nil { | ||
| return fmt.Errorf("failed to set %s: %w, output: %s", guidType, err, string(output)) | ||
| } | ||
| } | ||
|
|
||
| return nil | ||
| } |
There was a problem hiding this comment.
This new function, along with others in this PR, introduces a critical code path using exec.Command to work around the netlink issue. While the implementation looks correct and safe against command injection, it's crucial to have unit tests covering this new logic to ensure its correctness and robustness.
Please add tests for configureVfGUIDWithIPCommand and the other new code paths (like the one in configSriovDevice for setting the link up). You could use mocks for exec.Command and os.ReadFile to verify that:
- The correct commands and arguments are being constructed.
- The logic correctly handles both success and failure cases from external commands and file system operations.
| typePath := filepath.Join(vars.FilesystemRoot, consts.SysClassNet, ifaceName, "type") | ||
| data, err := os.ReadFile(typePath) |
There was a problem hiding this comment.
The iface.Name field, which originates from the SriovNetworkNodeState Custom Resource, is used to construct file paths in getLinkTypeFromSysfs without proper validation or sanitization. An attacker with permissions to modify the CR can provide a malicious name containing path traversal sequences (e.g., ../../etc/shadow/..) to attempt to read arbitrary files on the host system. While the impact is partially mitigated by the hardcoded suffix (/type), the content of the read file is exposed in logs if parsing fails (line 248), leading to potential information disclosure.
pkg/host/internal/sriov/sriov.go
Outdated
| flagsPath := filepath.Join(vars.FilesystemRoot, consts.SysClassNet, iface.Name, "flags") | ||
| data, err := os.ReadFile(flagsPath) |
There was a problem hiding this comment.
Similar to the issue in getLinkTypeFromSysfs, the iface.Name field is used here to construct a path to the interface flags file. An attacker could use path traversal to read files named flags in other directories. The content of the file is explicitly included in the error message on line 754 if parsing fails, which can be exposed to users via the Custom Resource status or system logs.
There was a problem hiding this comment.
Code Review
This pull request introduces a workaround for a kernel regression that causes issues with InfiniBand devices having a large number of Virtual Functions. The changes bypass netlink queries, which can fail due to message size limits, and instead use ip commands and direct sysfs reads for device configuration. The implementation is well-isolated and robust, with good error handling. I've identified one area with some redundant code that could be simplified.
pkg/host/internal/sriov/sriov.go
Outdated
| if pfLink == nil { | ||
| if err := s.configureVfGUIDWithIPCommand(addr, iface.PciAddress, iface.Name, vfID); err != nil { | ||
| return err | ||
| } | ||
| } else { | ||
| if err := s.infinibandHelper.ConfigureVfGUID(addr, iface.PciAddress, vfID, pfLink); err != nil { | ||
| return err | ||
| } | ||
| } |
There was a problem hiding this comment.
The else block starting at line 636 appears to be unreachable. The pfLink variable is intentionally set to nil for InfiniBand devices at line 585 to bypass netlink queries. The check at line 629 ensures this block is only entered for InfiniBand devices. Therefore, pfLink == nil will always be true, and the else branch will never be executed. This can be simplified by removing the conditional and the unreachable else block.
if err := s.configureVfGUIDWithIPCommand(addr, iface.PciAddress, iface.Name, vfID); err != nil {
return err
}|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request provides a solid workaround for the kernel netlink PAGE_SIZE limitation affecting InfiniBand devices with a large number of VFs. The approach of bypassing netlink calls for IB devices and using ip commands and sysfs reads is well-implemented and cleanly isolated. The refactoring of GUID generation into GetVfGUID and its use of crypto/rand are good improvements. The changes are well-commented and the logic is easy to follow. I have one suggestion to simplify a piece of code by removing an unreachable else block.
pkg/host/internal/sriov/sriov.go
Outdated
| if pfLink == nil { | ||
| if err := s.configureVfGUIDWithIPCommand(addr, iface.PciAddress, iface.Name, vfID); err != nil { | ||
| return err | ||
| } | ||
| } else { | ||
| if err := s.infinibandHelper.ConfigureVfGUID(addr, iface.PciAddress, vfID, pfLink); err != nil { | ||
| return err | ||
| } | ||
| } |
There was a problem hiding this comment.
The else block starting on line 636 appears to be unreachable. For InfiniBand devices, pfLink is intentionally set to nil before this loop to avoid problematic netlink calls. Consequently, the pfLink == nil check on line 632 will always be true for IB devices, making the else branch dead code. You can simplify this by removing the conditional and directly calling configureVfGUIDWithIPCommand.
if err := s.configureVfGUIDWithIPCommand(addr, iface.PciAddress, iface.Name, vfID); err != nil {
return err
}|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a workaround for a kernel regression that limits netlink message sizes, causing failures when configuring InfiniBand devices with many VFs. The approach of bypassing netlink for IB devices and using ip commands and sysfs directly is sound. The changes are well-isolated to InfiniBand device handling.
I've found one critical issue where some helper functions that still use netlink are called for InfiniBand devices, which would defeat the purpose of this workaround. I've left a specific comment with a suggested fix for this.
Otherwise, the code is well-structured and the new functions are clear. The refactoring in infiniband.go to extract GetVfGUID is a good improvement.
pkg/host/internal/sriov/sriov.go
Outdated
| iface.Mtu = s.networkHelper.GetNetdevMTU(device.Address) | ||
| iface.Mac = s.networkHelper.GetNetDevMac(pfNetName) |
There was a problem hiding this comment.
The functions s.networkHelper.GetNetdevMTU() and s.networkHelper.GetNetDevMac() both rely on netlink.LinkByName() internally. This will re-introduce the 'message too long' error for InfiniBand devices with many VFs, as querying the PF link via netlink is what this PR aims to avoid.
To fix this, the MTU and MAC address should be read directly from sysfs for InfiniBand devices, similar to how getLinkTypeFromSysfs is implemented.
| iface.Mtu = s.networkHelper.GetNetdevMTU(device.Address) | |
| iface.Mac = s.networkHelper.GetNetDevMac(pfNetName) | |
| if data, err := os.ReadFile(filepath.Join(vars.FilesystemRoot, consts.SysClassNet, pfNetName, "mtu")); err == nil { | |
| iface.Mtu, _ = strconv.Atoi(strings.TrimSpace(string(data))) | |
| } | |
| if data, err := os.ReadFile(filepath.Join(vars.FilesystemRoot, consts.SysClassNet, pfNetName, "address")); err == nil { | |
| iface.Mac = strings.TrimSpace(string(data)) | |
| } |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a workaround for a kernel regression that limits netlink message sizes, affecting InfiniBand devices with many VFs. The changes cleverly bypass netlink for IB devices by using direct ip commands and sysfs reads for device discovery and configuration. This is a well-isolated and necessary solution. The code is well-structured, and the changes for IB devices are clearly separated from the existing logic for other devices.
My review includes a few minor suggestions to improve logging consistency for better debuggability and to clean up a duplicated code comment. Overall, this is a solid contribution that addresses a critical issue.
| if data, err := os.ReadFile(filepath.Join(vars.FilesystemRoot, consts.SysClassNet, pfNetName, "address")); err == nil { | ||
| iface.Mac = strings.TrimSpace(string(data)) | ||
| } |
There was a problem hiding this comment.
For consistency with how MTU and admin state failures are handled in this function, consider adding a log message when reading the MAC address from sysfs fails. This will improve debuggability.
| if data, err := os.ReadFile(filepath.Join(vars.FilesystemRoot, consts.SysClassNet, pfNetName, "address")); err == nil { | |
| iface.Mac = strings.TrimSpace(string(data)) | |
| } | |
| if data, err := os.ReadFile(filepath.Join(vars.FilesystemRoot, consts.SysClassNet, pfNetName, "address")); err == nil { | |
| iface.Mac = strings.TrimSpace(string(data)) | |
| } else { | |
| log.Log.V(2).Info("DiscoverSriovDevices(): failed to get MAC from sysfs for IB device", "name", pfNetName, "error", err) | |
| } |
pkg/host/internal/sriov/sriov.go
Outdated
| if err != nil { | ||
| log.Log.Error(err, "configSriovPFDevice(): fail to set mtu for PF", "device", iface.PciAddress) | ||
| return err | ||
| // set PF mtu |
pkg/host/internal/sriov/sriov.go
Outdated
| if mtu, err := s.getMTUFromSysfs(ifName); err == nil { | ||
| currentMtu = mtu | ||
| } |
There was a problem hiding this comment.
When getMTUFromSysfs returns an error, it is silently ignored. This could make debugging difficult. Consider adding a log message in an else block to record the error, which would be consistent with other parts of the code.
if mtu, err := s.getMTUFromSysfs(ifName); err == nil {
currentMtu = mtu
} else {
log.Log.V(2).Info("configSriovPFDevice(): failed to get MTU from sysfs for IB device", "name", ifName, "error", err)
}
pkg/host/internal/sriov/sriov.go
Outdated
| if mtu, err := s.getMTUFromSysfs(ifName); err == nil { | ||
| currentMtu = mtu | ||
| } |
There was a problem hiding this comment.
When getMTUFromSysfs returns an error, it is silently ignored. This could make debugging difficult. Consider adding a log message in the else block to record the error, similar to how it's done in DiscoverSriovDevices.
if mtu, err := s.getMTUFromSysfs(ifName); err == nil {
currentMtu = mtu
} else {
log.Log.V(2).Info("checkExternallyManagedPF(): failed to get MTU from sysfs for IB device", "name", ifName, "error", err)
}f4d52c9 to
0e4cd66
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a workaround for a kernel regression that causes 'message too long' errors when configuring InfiniBand devices with a large number of virtual functions. The solution is to bypass netlink library calls for InfiniBand devices and instead use ip commands for configuration and read from sysfs for device state.
The changes are well-isolated to handle InfiniBand devices differently, while retaining the existing logic for other device types. New helper functions for interacting with sysfs and executing ip commands have been added, which improves modularity. The refactoring in the infiniband package to separate GUID generation is also a good improvement.
Overall, this is a solid implementation of the described workaround. I have a couple of minor suggestions for code cleanup.
| log.Log.V(2).Info("DiscoverSriovDevices(): failed to get MAC from sysfs for IB device", "name", pfNetName, "error", err) | ||
| } | ||
| iface.LinkType = linkType | ||
| iface.LinkSpeed = s.networkHelper.GetNetDevLinkSpeed(pfNetName) |
InfiniBand devices with 16+ VFs exceed the kernel's PAGE_SIZE limit
(~4KB) when querying via netlink. This causes 'message too long' errors
during discovery and configuration.
Root Cause:
- Kernel netlink responses are capped at PAGE_SIZE regardless of socket
buffer configuration
- IB devices return extensive metadata (MAC, VLAN, Node GUID, Port GUID,
RDMA stats) that exceeds this limit with many VFs
Solution:
- Bypass netlink queries for IB devices by using sysfs and ip commands
- Detect IB devices via /sys/class/net/<iface>/type (ARPHRD_INFINIBAND=32)
- Read attributes directly from sysfs (MTU, MAC, link flags, link type)
- Set attributes via 'ip link set' commands to avoid netlink
Changes:
1. Added helper functions for sysfs access:
- getLinkTypeFromSysfs(): Detect IB vs Ethernet
- getMTUFromSysfs(): Read MTU from sysfs
- setMTUViaIPCommand(): Set MTU via ip command
- getLinkAdminStateFromSysfs(): Read IFF_UP flag from sysfs
- configureVfGUIDWithIPCommand(): Set VF GUIDs via ip command
2. Fixed DiscoverSriovDevices():
- Check link type before using netlink
- Use sysfs for IB device discovery
3. Fixed DiscoverSriovVirtualDevices():
- Use sysfs for IB VF discovery
4. Fixed configSriovDevice():
- Set PF link up via ip command for IB devices
5. Fixed configSriovVFDevices():
- Skip netlink LinkByName for IB PFs
- Use ip commands for VF GUID configuration
6. Fixed ResetSriovDevice():
- Use ip command for IB MTU reset
7. Fixed configSriovPFDevice():
- Check link type before MTU operations
- Use sysfs/ip commands for IB devices
8. Fixed checkExternallyManagedPF():
- Use sysfs for IB MTU checks
9. Added constants:
- ARPHRD_ETHER = 1
- ARPHRD_INFINIBAND = 32
10. Updated InfinibandInterface:
- Added GetVfGUID() method for GUID retrieval
- Use crypto/rand for secure random GUID generation
Fixes: OCPBUGS-74889
Fixes: RHEL-145522
0e4cd66 to
3e6eaa9
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a workaround for a kernel regression that limits netlink message size, affecting InfiniBand devices with many VFs. The changes bypass netlink queries for IB devices by using direct ip commands and sysfs reads for device discovery and configuration. The overall approach is sound and well-isolated to InfiniBand device handling. The code is clear and the changes are consistent across the affected functions. I've added a few suggestions to improve maintainability by reducing some code duplication. These are minor refactoring opportunities and don't affect the correctness of the implementation.
| if isIB { | ||
| // For InfiniBand devices with many VFs, avoid netlink query | ||
| log.Log.V(2).Info("DiscoverSriovDevices(): detected InfiniBand device, using sysfs instead of netlink", | ||
| "device", device.Address, "name", pfNetName) | ||
| // Read MTU and MAC directly from sysfs to avoid netlink | ||
| if mtu, err := s.getMTUFromSysfs(pfNetName); err == nil { | ||
| iface.Mtu = mtu | ||
| } else { | ||
| log.Log.V(2).Info("DiscoverSriovDevices(): failed to get MTU from sysfs for IB device", "name", pfNetName, "error", err) | ||
| } | ||
| if data, err := os.ReadFile(filepath.Join(vars.FilesystemRoot, consts.SysClassNet, pfNetName, "address")); err == nil { | ||
| iface.Mac = strings.TrimSpace(string(data)) | ||
| } else { | ||
| log.Log.V(2).Info("DiscoverSriovDevices(): failed to get MAC from sysfs for IB device", "name", pfNetName, "error", err) | ||
| } | ||
| iface.LinkType = linkType | ||
| iface.LinkSpeed = s.networkHelper.GetNetDevLinkSpeed(pfNetName) | ||
| if adminState, err := s.getLinkAdminStateFromSysfs(pfNetName); err == nil { | ||
| iface.LinkAdminState = adminState | ||
| } else { | ||
| log.Log.V(2).Info("DiscoverSriovDevices(): failed to get link admin state from sysfs for IB device", "name", pfNetName, "error", err) | ||
| } | ||
| } else { | ||
| // For non-IB devices, use netlink as before | ||
| link, err := s.netlinkLib.LinkByName(pfNetName) | ||
| if err != nil { | ||
| log.Log.Error(err, "DiscoverSriovDevices(): unable to get Link for device, skipping", "device", device.Address) | ||
| continue | ||
| } | ||
|
|
||
| iface.Mtu = link.Attrs().MTU | ||
| iface.Mac = link.Attrs().HardwareAddr.String() | ||
| iface.LinkType = s.encapTypeToLinkType(link.Attrs().EncapType) | ||
| iface.LinkSpeed = s.networkHelper.GetNetDevLinkSpeed(pfNetName) | ||
| iface.LinkAdminState = s.networkHelper.GetNetDevLinkAdminState(pfNetName) | ||
| } |
There was a problem hiding this comment.
pkg/host/internal/sriov/sriov.go
Outdated
| if err != nil { | ||
| log.Log.Error(err, "configSriovPFDevice(): fail to set mtu for PF", "device", iface.PciAddress) | ||
| return err | ||
| // set PF mtu |
| if iface.Mtu > 0 { | ||
| linkType := s.getLinkTypeWithIBFallback(iface) | ||
| var currentMtu int | ||
| if strings.EqualFold(linkType, consts.LinkTypeIB) { | ||
| // Use sysfs for IB to avoid netlink PAGE_SIZE limit | ||
| ifName := s.networkHelper.TryGetInterfaceName(iface.PciAddress) | ||
| if ifName == "" { | ||
| log.Log.Error(nil, "configSriovPFDevice(): failed to get interface name", "device", iface.PciAddress) | ||
| return fmt.Errorf("failed to get interface name for %s", iface.PciAddress) | ||
| } | ||
| if mtu, err := s.getMTUFromSysfs(ifName); err == nil { | ||
| currentMtu = mtu | ||
| } | ||
| if iface.Mtu > currentMtu { | ||
| if err := s.setMTUViaIPCommand(ifName, iface.Mtu); err != nil { | ||
| log.Log.Error(err, "configSriovPFDevice(): fail to set mtu for PF", "device", iface.PciAddress) | ||
| return err | ||
| } | ||
| } | ||
| } else { | ||
| // Use netlink for non-IB devices | ||
| currentMtu = s.networkHelper.GetNetdevMTU(iface.PciAddress) | ||
| if iface.Mtu > currentMtu { | ||
| err = s.networkHelper.SetNetdevMTU(iface.PciAddress, iface.Mtu) | ||
| if err != nil { | ||
| log.Log.Error(err, "configSriovPFDevice(): fail to set mtu for PF", "device", iface.PciAddress) | ||
| return err | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
The logic for setting the MTU, including the check iface.Mtu > currentMtu and the error handling, is duplicated between the InfiniBand and non-InfiniBand cases. This can be refactored to reduce duplication and improve maintainability. You could determine currentMtu and the appropriate setMtu function within the if/else block, and then call the function and handle the error once after the block.
pkg/host/internal/sriov/sriov.go
Outdated
| linkType := s.getLinkTypeWithIBFallback(iface) | ||
| var currentMtu int | ||
| if strings.EqualFold(linkType, consts.LinkTypeIB) { | ||
| // Use sysfs for IB to avoid netlink PAGE_SIZE limit | ||
| ifName := s.networkHelper.TryGetInterfaceName(iface.PciAddress) | ||
| if ifName != "" { | ||
| if mtu, err := s.getMTUFromSysfs(ifName); err == nil { | ||
| currentMtu = mtu | ||
| } | ||
| } | ||
| } else { | ||
| currentMtu = s.networkHelper.GetNetdevMTU(iface.PciAddress) | ||
| } |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a workaround for a kernel bug that limits netlink message sizes, affecting InfiniBand devices with many VFs. The solution bypasses netlink by using ip commands and reading from sysfs directly, which is a sound approach. The changes are well-contained, and the use of crypto/rand for GUID generation is a good security improvement. My review includes suggestions to improve maintainability by reducing code duplication and to fix a potential bug in the validation logic for externally managed devices to ensure more robust error handling.
pkg/host/internal/sriov/sriov.go
Outdated
| if strings.EqualFold(linkType, consts.LinkTypeIB) { | ||
| // Use sysfs for IB to avoid netlink PAGE_SIZE limit | ||
| ifName := s.networkHelper.TryGetInterfaceName(iface.PciAddress) | ||
| if ifName != "" { | ||
| if mtu, err := s.getMTUFromSysfs(ifName); err == nil { | ||
| currentMtu = mtu | ||
| } | ||
| } |
There was a problem hiding this comment.
In checkExternallyManagedPF, if TryGetInterfaceName returns an empty string for an InfiniBand device, the function proceeds silently, and currentMtu remains 0. This can lead to incorrect validation, as the subsequent MTU check (iface.Mtu > currentMtu) might pass when it should have been validated against the actual device MTU. This could allow an invalid configuration for an externally managed device. It would be safer to return an error if the interface name or MTU cannot be determined, ensuring the check is robust.
| if strings.EqualFold(linkType, consts.LinkTypeIB) { | |
| // Use sysfs for IB to avoid netlink PAGE_SIZE limit | |
| ifName := s.networkHelper.TryGetInterfaceName(iface.PciAddress) | |
| if ifName != "" { | |
| if mtu, err := s.getMTUFromSysfs(ifName); err == nil { | |
| currentMtu = mtu | |
| } | |
| } | |
| if strings.EqualFold(linkType, consts.LinkTypeIB) { | |
| // Use sysfs for IB to avoid netlink PAGE_SIZE limit | |
| ifName := s.networkHelper.TryGetInterfaceName(iface.PciAddress) | |
| if ifName == "" { | |
| err := fmt.Errorf("checkExternallyManagedPF(): failed to get interface name for IB device %s", iface.PciAddress) | |
| log.Log.Error(err, "") | |
| return err | |
| } | |
| mtu, err := s.getMTUFromSysfs(ifName) | |
| if err != nil { | |
| log.Log.Error(err, "checkExternallyManagedPF(): failed to get MTU from sysfs for IB device", "name", ifName) | |
| return err | |
| } | |
| currentMtu = mtu |
pkg/host/internal/sriov/sriov.go
Outdated
| if err != nil { | ||
| log.Log.Error(err, "configSriovPFDevice(): fail to set mtu for PF", "device", iface.PciAddress) | ||
| return err | ||
| // set PF mtu |
pkg/host/internal/sriov/sriov.go
Outdated
| linkType := s.getLinkTypeWithIBFallback(iface) | ||
| var currentMtu int | ||
| if strings.EqualFold(linkType, consts.LinkTypeIB) { | ||
| // Use sysfs for IB to avoid netlink PAGE_SIZE limit | ||
| ifName := s.networkHelper.TryGetInterfaceName(iface.PciAddress) | ||
| if ifName == "" { | ||
| log.Log.Error(nil, "configSriovPFDevice(): failed to get interface name", "device", iface.PciAddress) | ||
| return fmt.Errorf("failed to get interface name for %s", iface.PciAddress) | ||
| } | ||
| if mtu, err := s.getMTUFromSysfs(ifName); err == nil { | ||
| currentMtu = mtu | ||
| } | ||
| if iface.Mtu > currentMtu { | ||
| if err := s.setMTUViaIPCommand(ifName, iface.Mtu); err != nil { | ||
| log.Log.Error(err, "configSriovPFDevice(): fail to set mtu for PF", "device", iface.PciAddress) | ||
| return err | ||
| } | ||
| } | ||
| } else { | ||
| // Use netlink for non-IB devices | ||
| currentMtu = s.networkHelper.GetNetdevMTU(iface.PciAddress) | ||
| if iface.Mtu > currentMtu { | ||
| err = s.networkHelper.SetNetdevMTU(iface.PciAddress, iface.Mtu) | ||
| if err != nil { | ||
| log.Log.Error(err, "configSriovPFDevice(): fail to set mtu for PF", "device", iface.PciAddress) | ||
| return err | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
The logic to determine the current MTU for an interface, which handles both InfiniBand and other link types, is duplicated between this function (configSriovPFDevice) and checkExternallyManagedPF (lines 645-658). To improve maintainability and avoid potential inconsistencies in the future, consider extracting this logic into a new helper method on the sriov struct, for example: getCurrentMtu(iface *sriovnetworkv1.Interface) (int, error).
- Create getCurrentMtu() helper function that properly handles IB devices - Use sysfs for IB devices, netlink for others - Replace duplicated MTU logic in configSriovPFDevice() and checkExternallyManagedPF() - Ensure proper error handling for IB devices (return error instead of silently using 0) - Remove duplicate comment "// set PF mtu" Signed-off-by: Beni Ben Shabat <bbenshab@redhat.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a workaround for a kernel netlink issue affecting InfiniBand devices with numerous VFs, which can cause "message too long" errors. The solution cleverly bypasses netlink for these devices by leveraging direct ip command execution for configuration and reading device state from sysfs. The changes are well-isolated to InfiniBand device handling, ensuring that other devices continue to use the existing netlink-based logic. The implementation is clean and adds several helper functions for sysfs interaction, improving modularity. I've included a few suggestions to enhance code clarity by removing redundant function calls and to address a potential bug related to a missing check.
pkg/host/internal/sriov/sriov.go
Outdated
| ifName := s.networkHelper.TryGetInterfaceName(iface.PciAddress) | ||
| if err := s.setMTUViaIPCommand(ifName, iface.Mtu); err != nil { |
There was a problem hiding this comment.
There's a potential issue here. s.networkHelper.TryGetInterfaceName can return an empty string, but there is no check for it before calling setMTUViaIPCommand. This will cause the ip command to fail. Please add a check for an empty interface name.
Also, iface.Name should already contain the interface name, making the call to TryGetInterfaceName redundant.
pkg/host/internal/sriov/sriov.go
Outdated
| ifName := s.networkHelper.TryGetInterfaceName(ifaceStatus.PciAddress) | ||
| if ifName == "" { | ||
| return fmt.Errorf("failed to get interface name for %s", ifaceStatus.PciAddress) | ||
| } |
There was a problem hiding this comment.
The interface name is already available in ifaceStatus.Name. Calling TryGetInterfaceName again is redundant. You can use ifaceStatus.Name directly.
| ifName := s.networkHelper.TryGetInterfaceName(ifaceStatus.PciAddress) | |
| if ifName == "" { | |
| return fmt.Errorf("failed to get interface name for %s", ifaceStatus.PciAddress) | |
| } | |
| ifName := ifaceStatus.Name | |
| if ifName == "" { | |
| return fmt.Errorf("failed to get interface name for %s", ifaceStatus.PciAddress) | |
| } |
pkg/host/internal/sriov/sriov.go
Outdated
| ifName := s.networkHelper.TryGetInterfaceName(iface.PciAddress) | ||
| if ifName == "" { | ||
| return 0, fmt.Errorf("failed to get interface name for IB device %s", iface.PciAddress) | ||
| } |
There was a problem hiding this comment.
The interface name should already be available in iface.Name. Calling TryGetInterfaceName again is redundant. You can use iface.Name directly.
| ifName := s.networkHelper.TryGetInterfaceName(iface.PciAddress) | |
| if ifName == "" { | |
| return 0, fmt.Errorf("failed to get interface name for IB device %s", iface.PciAddress) | |
| } | |
| ifName := iface.Name | |
| if ifName == "" { | |
| return 0, fmt.Errorf("failed to get interface name for IB device %s", iface.PciAddress) | |
| } |
- Replace TryGetInterfaceName() calls with direct .Name field access in ResetSriovDevice, getCurrentMtu, and configSriovPFDevice - Add empty name check in configSriovPFDevice before calling setMTUViaIPCommand to prevent ip command failures - Both Interface and InterfaceExt structs already have Name field populated, making TryGetInterfaceName redundant in these contexts Signed-off-by: Beni Ben Shabat <bbenshab@redhat.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a workaround for a kernel regression affecting InfiniBand devices with many VFs, which causes netlink messages to exceed PAGE_SIZE. The solution bypasses netlink for IB devices by using direct ip commands for configuration and reading device information from sysfs. A security audit confirmed no vulnerabilities meeting the required severity threshold were identified, and the implementation follows security best practices, including safe handling of os/exec to prevent command injection. My main feedback is a minor refactoring suggestion to reduce code duplication for better maintainability.
| if isIB { | ||
| // For InfiniBand devices with many VFs, avoid netlink query | ||
| log.Log.V(2).Info("DiscoverSriovDevices(): detected InfiniBand device, using sysfs instead of netlink", | ||
| "device", device.Address, "name", pfNetName) | ||
| // Read MTU and MAC directly from sysfs to avoid netlink | ||
| if mtu, err := s.getMTUFromSysfs(pfNetName); err == nil { | ||
| iface.Mtu = mtu | ||
| } else { | ||
| log.Log.V(2).Info("DiscoverSriovDevices(): failed to get MTU from sysfs for IB device", "name", pfNetName, "error", err) | ||
| } | ||
| if data, err := os.ReadFile(filepath.Join(vars.FilesystemRoot, consts.SysClassNet, pfNetName, "address")); err == nil { | ||
| iface.Mac = strings.TrimSpace(string(data)) | ||
| } else { | ||
| log.Log.V(2).Info("DiscoverSriovDevices(): failed to get MAC from sysfs for IB device", "name", pfNetName, "error", err) | ||
| } | ||
| iface.LinkType = linkType | ||
| iface.LinkSpeed = s.networkHelper.GetNetDevLinkSpeed(pfNetName) | ||
| if adminState, err := s.getLinkAdminStateFromSysfs(pfNetName); err == nil { | ||
| iface.LinkAdminState = adminState | ||
| } else { | ||
| log.Log.V(2).Info("DiscoverSriovDevices(): failed to get link admin state from sysfs for IB device", "name", pfNetName, "error", err) | ||
| } | ||
| } else { | ||
| // For non-IB devices, use netlink as before | ||
| link, err := s.netlinkLib.LinkByName(pfNetName) | ||
| if err != nil { | ||
| log.Log.Error(err, "DiscoverSriovDevices(): unable to get Link for device, skipping", "device", device.Address) | ||
| continue | ||
| } | ||
|
|
||
| iface.Mtu = link.Attrs().MTU | ||
| iface.Mac = link.Attrs().HardwareAddr.String() | ||
| iface.LinkType = s.encapTypeToLinkType(link.Attrs().EncapType) | ||
| iface.LinkSpeed = s.networkHelper.GetNetDevLinkSpeed(pfNetName) | ||
| iface.LinkAdminState = s.networkHelper.GetNetDevLinkAdminState(pfNetName) | ||
| } |
…Fallback call - Call getLinkTypeWithIBFallback once and reuse the linkType variable - Inline MTU retrieval logic to avoid redundant function call - Reduces overhead by avoiding duplicate link type detection Signed-off-by: Beni Ben Shabat <bbenshab@redhat.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a workaround for a kernel regression affecting InfiniBand devices with many VFs, which causes netlink queries to fail. The solution correctly bypasses netlink for these devices by using ip commands and reading from sysfs directly. The changes are well-structured, introducing new helper functions to encapsulate the new logic. Additionally, the GUID generation has been improved to use crypto/rand for better security. I have one suggestion to refactor a section of the code to improve maintainability by reducing duplication.
| if iface.Mtu > 0 { | ||
| linkType := s.getLinkTypeWithIBFallback(iface) | ||
| var currentMtu int | ||
| if strings.EqualFold(linkType, consts.LinkTypeIB) { | ||
| ifName := iface.Name | ||
| if ifName == "" { | ||
| err := fmt.Errorf("failed to get interface name for IB device %s", iface.PciAddress) | ||
| log.Log.Error(err, "configSriovPFDevice(): failed to get current MTU") | ||
| return err | ||
| } | ||
| mtu, err := s.getMTUFromSysfs(ifName) | ||
| if err != nil { | ||
| log.Log.Error(err, "configSriovPFDevice(): failed to get current MTU from sysfs", "device", iface.PciAddress) | ||
| return err | ||
| } | ||
| currentMtu = mtu | ||
| } else { | ||
| currentMtu = s.networkHelper.GetNetdevMTU(iface.PciAddress) | ||
| } | ||
|
|
||
| if iface.Mtu > currentMtu { | ||
| if strings.EqualFold(linkType, consts.LinkTypeIB) { | ||
| ifName := iface.Name | ||
| if ifName == "" { | ||
| err := fmt.Errorf("interface name is empty for device %s", iface.PciAddress) | ||
| log.Log.Error(err, "configSriovPFDevice(): fail to get interface name") | ||
| return err | ||
| } | ||
| if err := s.setMTUViaIPCommand(ifName, iface.Mtu); err != nil { | ||
| log.Log.Error(err, "configSriovPFDevice(): fail to set mtu for PF", "device", iface.PciAddress) | ||
| return err | ||
| } | ||
| } else { | ||
| err = s.networkHelper.SetNetdevMTU(iface.PciAddress, iface.Mtu) | ||
| if err != nil { | ||
| log.Log.Error(err, "configSriovPFDevice(): fail to set mtu for PF", "device", iface.PciAddress) | ||
| return err | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
This block of code duplicates the logic for getting the current MTU, which is already encapsulated in the getCurrentMtu function introduced in this same pull request. Using the existing helper function will reduce code duplication and improve maintainability.
if iface.Mtu > 0 {
currentMtu, err := s.getCurrentMtu(iface)
if err != nil {
log.Log.Error(err, "configSriovPFDevice(): failed to get current MTU", "device", iface.PciAddress)
return err
}
if iface.Mtu > currentMtu {
linkType := s.getLinkTypeWithIBFallback(iface)
if strings.EqualFold(linkType, consts.LinkTypeIB) {
if err := s.setMTUViaIPCommand(iface.Name, iface.Mtu); err != nil {
log.Log.Error(err, "configSriovPFDevice(): fail to set mtu for PF", "device", iface.PciAddress)
return err
}
} else {
if err := s.networkHelper.SetNetdevMTU(iface.PciAddress, iface.Mtu); err != nil {
log.Log.Error(err, "configSriovPFDevice(): fail to set mtu for PF", "device", iface.PciAddress)
return err
}
}
}
}- Replace duplicated MTU retrieval logic in configSriovPFDevice with getCurrentMtu call - Improves maintainability by centralizing MTU logic in one place - Simplifies error handling by reusing existing helper function Signed-off-by: Beni Ben Shabat <bbenshab@redhat.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a workaround for a kernel bug affecting InfiniBand devices by using ip commands instead of netlink calls to avoid PAGE_SIZE limitations. The changes are well-isolated to IB devices, which are now detected by reading from sysfs. New helper functions are introduced to interact with sysfs and execute ip commands. The GUID generation for IB VFs is also refactored. The changes are logical and well-implemented. I have one suggestion for a minor refactoring to reduce code duplication.
| linkType := s.getLinkTypeWithIBFallback(iface) | ||
| if strings.EqualFold(linkType, consts.LinkTypeIB) { | ||
| if err := s.setMTUViaIPCommand(iface.Name, iface.Mtu); err != nil { | ||
| log.Log.Error(err, "configSriovPFDevice(): fail to set mtu for PF", "device", iface.PciAddress) | ||
| return err | ||
| } | ||
| } else { | ||
| if err := s.networkHelper.SetNetdevMTU(iface.PciAddress, iface.Mtu); err != nil { | ||
| log.Log.Error(err, "configSriovPFDevice(): fail to set mtu for PF", "device", iface.PciAddress) | ||
| return err | ||
| } | ||
| } |
There was a problem hiding this comment.
The error handling logic for setting the MTU is duplicated for the InfiniBand and non-InfiniBand cases. This can be refactored to reduce code duplication and improve readability.
linkType := s.getLinkTypeWithIBFallback(iface)
var err error
if strings.EqualFold(linkType, consts.LinkTypeIB) {
err = s.setMTUViaIPCommand(iface.Name, iface.Mtu)
} else {
err = s.networkHelper.SetNetdevMTU(iface.PciAddress, iface.Mtu)
}
if err != nil {
log.Log.Error(err, "configSriovPFDevice(): fail to set mtu for PF", "device", iface.PciAddress)
return err
}InfiniBand devices with 16+ VFs exceed the kernel's PAGE_SIZE limit (~4KB) when querying via netlink. This causes 'message too long' errors when the CNI plugin tries to configure VFs. Root Cause: - Kernel netlink responses are capped at PAGE_SIZE regardless of socket buffer configuration - IB devices return extensive metadata (MAC, VLAN, Node GUID, Port GUID, RDMA stats) that exceeds this limit with many VFs - The SR-IOV CNI plugin queries the PF link to apply/fill/reset VF config Solution: - Detect InfiniBand devices via sysfs /sys/class/net/<iface>/type - Skip netlink queries for IB PFs when configuring VFs - Read MTU from sysfs instead of netlink for IB devices - Skip VF configuration operations that don't apply to IB (VLAN, spoofchk) Changes: 1. pkg/utils/utils.go: - Added ARPHRD_ETHER and ARPHRD_INFINIBAND constants - Added IsInfiniBandDevice() function to detect IB via sysfs 2. pkg/sriov/sriov.go: - Added os, path/filepath, strconv, strings imports - Modified ApplyVFConfig() to skip netlink for IB devices - Modified FillOriginalVfInfo() to skip netlink for IB devices - Modified ResetVFConfig() to skip netlink for IB devices - Read MTU from sysfs for IB devices instead of netlink This fix complements the sriov-network-operator fix (PR #1026) which addressed the same issue in the config daemon. Fixes: k8snetworkplumbingwg/sriov-network-operator#1026
Set VF policy to 'Follow' for InfiniBand devices so VFs inherit PF link state. This is required for IB VFs to come up when the PF is up. The policy is set via sysfs (/sys/class/net/<pf>/device/sriov/<vfid>/policy) to avoid netlink PAGE_SIZE limit issues with IB devices that have many VFs.
|
Closed by mistake - reopening. |
|
Thanks for your PR,
To skip the vendors CIs, Maintainers can use one of:
|
|
Hi @bbenshab let me know if we can close this one, as this is kernel regression I don't think we need to implement w/a in the operator just let the linux kernel to fix the regression. |
Problem
When configuring InfiniBand devices with 16 VFs, the SR-IOV Network Operator fails with "message too long" errors:
Root Cause
This is caused by a kernel regression (see OCPBUGS-74889 and RHEL-145522) where the Linux kernel limits individual netlink messages to PAGE_SIZE (~4KB on x86_64).
InfiniBand devices with 16 VFs return metadata (MAC, VLAN, Node GUID, Port GUID, RDMA stats per VF) that exceeds this 4KB limit. The kernel enforces this limit at the message level, making it impossible to query link information via netlink for IB devices with many VFs.
Note: While this is a kernel regression that should ultimately be fixed in the kernel, we cannot wait for a kernel fix as we need to support InfiniBand devices now. This workaround provides a working solution that is isolated to IB device handling.
Solution
Skip netlink queries entirely for InfiniBand devices and use direct
ip link setcommands to configure VF GUIDs:LinkByName()for IB devices, setpfLink=nilconfigureVfGUIDWithIPCommand()whenpfLink=nilconfigureVfGUIDWithIPCommand()generates random GUIDs and usesip link setcommands directlyip link setfor IB devices, netlink for non-IB devicesWhy this works:
ip link setcommands use netlink SETLINK operations that don't query VF dataChanges
Testing
Environment:
References
Note on Kernel Regression
While this issue originates from a kernel regression, we cannot wait for a kernel fix because:
This PR provides a working solution that unblocks InfiniBand deployments while the kernel team addresses the underlying regression.