feat: Add RDMA device support with CDI integration#38
feat: Add RDMA device support with CDI integration#38rollandf wants to merge 1 commit intok8snetworkplumbingwg:mainfrom
Conversation
|
depends on #32 |
Pull Request Test Coverage Report for Build 22016933456Details
💛 - Coveralls |
5d916f2 to
8444628
Compare
WalkthroughAdds RDMA detection and handling: discovery marks per‑VF RDMA capability and NUMA; host gains an RdmaProvider abstraction and RDMA APIs; state injects RDMA char devices and RDMA-related env vars into CDI specs; tests and mocks updated to cover RDMA flows. Changes
Sequence Diagram(s)sequenceDiagram
participant Discovery
participant Host
participant RdmaProvider
participant rdmamap
participant State
participant CDI
Discovery->>Host: VerifyRDMACapability(pciAddr)
Host->>RdmaProvider: GetRdmaDevicesForPcidev(pciAddr)
RdmaProvider->>rdmamap: GetRdmaDevicesForPcidev(pciAddr)
rdmamap-->>RdmaProvider: [rdmaDevices]
RdmaProvider-->>Host: [rdmaDevices]
Host-->>Discovery: bool (has RDMA devices)
Discovery->>Discovery: set RDMACapable, NumaNode in VF attributes
State->>Host: GetRDMADeviceForPCI(pciAddr)
Host->>RdmaProvider: GetRdmaDevicesForPcidev(pciAddr)
RdmaProvider->>rdmamap: GetRdmaDevicesForPcidev(pciAddr)
rdmamap-->>RdmaProvider: [rdmaDevices]
RdmaProvider-->>Host: [rdmaDevices]
Host-->>State: [rdmaDevices]
loop for each rdmaDevice
State->>Host: GetRDMACharDevices(rdmaDeviceName)
Host->>RdmaProvider: GetRdmaCharDevices(rdmaDeviceName)
RdmaProvider->>rdmamap: GetRdmaCharDevices(rdmaDeviceName)
rdmamap-->>RdmaProvider: [charDevices]
RdmaProvider-->>Host: [charDevices]
Host-->>State: [charDevices]
State->>CDI: Append CDI char device node(s)
State->>State: Add env vars (uverbs, umad, issm, rdma_cm, device name)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (10)
pkg/devicestate/state_test.go (3)
90-118: Tests only verify mock behavior, not production code.This test sets up mock expectations and then calls the mock methods directly, verifying that mocks return expected values. It doesn't actually test the production
applyConfigOnDevicefunction that would use these mocks to prepare RDMA devices.Consider creating an integration test that:
- Sets up a
Managerinstance with the mock host- Calls the actual
applyConfigOnDeviceorPrepareDevicesForClaimmethod- Verifies the resulting CDI spec contains the expected device nodes
🔎 Example integration test approach
It("should integrate RDMA device preparation into CDI spec", func() { // Setup Manager with mocks manager := &Manager{ allocatable: map[string]resourceapi.Device{ "0000-08-00-5": *deviceInfo, }, // ... other required fields } // Mock the host.GetHelpers() to return mockHost // Then call actual production code and verify CDI spec output })
120-135: Environment variable format test is documentation-only.This test documents the expected environment variable naming convention but doesn't verify that the production code actually generates these variables. The assertions only check the test's own
expectedEnvVarsmap.
228-243: Character device type identification test is basic.This test only verifies that path strings contain expected substrings. The actual device type parsing logic in production code (the
switchstatement instate.go) is not exercised here.pkg/devicestate/state.go (3)
217-223: Silent failure on GetRDMADeviceForPCI error may mask issues.When
GetRDMADeviceForPCIfails, the error is logged but the device is still prepared without RDMA support. While this allows graceful degradation, users may not notice that RDMA was silently disabled.Consider adding a warning-level log or an event to make this more visible, especially since the device was discovered as RDMA-capable during initial discovery.
243-248: Potential duplicate CDI device nodes for shared devices like rdma_cm.
/dev/infiniband/rdma_cmis typically a system-wide device, not per-RDMA-adapter. IfGetRDMACharDevicesreturns it for multiple RDMA devices, it will be added multiple times todeviceNodes.Consider deduplicating device nodes before adding them to the CDI spec:
🔎 Suggested deduplication approach
// Track added device paths to avoid duplicates addedDevices := make(map[string]bool) for _, charDev := range charDevices { if addedDevices[charDev] { continue // Skip duplicate } addedDevices[charDev] = true deviceNodes = append(deviceNodes, &cdispec.DeviceNode{ Path: charDev, HostPath: charDev, Type: "c", }) // ... rest of the logic }
252-262: Consider extracting devicePrefix outside the inner loop.The
devicePrefixis recomputed for every character device but only depends onresult.Device, which doesn't change within the loops. Moving it outside improves readability and avoids redundant computation.+ devicePrefix := strings.ReplaceAll(result.Device, "-", "_") for _, rdmaDevice := range rdmaDevices { // ... for _, charDev := range charDevices { // ... - devicePrefix := strings.ReplaceAll(result.Device, "-", "_") switch {pkg/devicestate/discovery.go (1)
145-151: Consider logging at a lower verbosity level for non-critical errors.The RDMA capability check correctly falls back to
falseon error. However, usinglogger.Errorfor this may be too noisy in environments without RDMA hardware. Consider usinglogger.V(1).Infoorlogger.V(2).Infoinstead, since this is an expected condition for non-RDMA devices.🔎 Suggested change
// Check RDMA capability for this VF rdmaCapable, err := host.GetHelpers().VerifyRDMACapability(vfInfo.PciAddress) if err != nil { - logger.Error(err, "Failed to verify RDMA capability, assuming not RDMA capable", + logger.V(2).Info("Failed to verify RDMA capability, assuming not RDMA capable", + "error", err, "vfAddress", vfInfo.PciAddress) rdmaCapable = false }pkg/host/host_test.go (1)
582-587: Consider adding type assertion safety check.The type assertion
host.NewHost().(*host.Host)will panic ifNewHost()ever returns a different implementation. While this is unlikely in the current codebase, adding a safety check improves test robustness.🔎 Suggested change
BeforeEach(func() { mockCtrl = gomock.NewController(GinkgoT()) mockRdmaProvider = mock_host.NewMockRdmaProvider(mockCtrl) - hostImpl = host.NewHost().(*host.Host) + var ok bool + hostImpl, ok = host.NewHost().(*host.Host) + Expect(ok).To(BeTrue(), "NewHost() should return *host.Host") hostImpl.SetRdmaProvider(mockRdmaProvider) })pkg/host/host.go (2)
153-157: Consider adding nil guard for defensive coding.While this method is primarily for testing, a nil check would prevent potential nil pointer dereferences if called incorrectly.
🔎 Proposed defensive nil check
func (h *Host) SetRdmaProvider(provider RdmaProvider) { + if provider == nil { + return + } h.rdmaProvider = provider }
786-800: Error return value is never non-nil.The method signature includes an error return, but the implementation never returns an error since
rdmaProvider.GetRdmaDevicesForPcidevonly returns[]string. This may be intentional for future extensibility, but consider documenting this in the method's godoc or updating the interface if errors should be propagated.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (11)
go.mod(2 hunks)pkg/consts/consts.go(2 hunks)pkg/devicestate/discovery.go(1 hunks)pkg/devicestate/discovery_test.go(7 hunks)pkg/devicestate/state.go(1 hunks)pkg/devicestate/state_test.go(2 hunks)pkg/host/host.go(3 hunks)pkg/host/host_test.go(2 hunks)pkg/host/mock/mock_host.go(2 hunks)pkg/host/mock/mock_rdma_provider.go(1 hunks)pkg/host/rdma_provider.go(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-12-10T11:30:04.648Z
Learnt from: rollandf
Repo: k8snetworkplumbingwg/dra-driver-sriov PR: 32
File: pkg/host/host.go:818-879
Timestamp: 2025-12-10T11:30:04.648Z
Learning: In Linux sysfs, /sys/class/infiniband/<device>/node_type exposes values like '1: CA' or '4: RNIC'. Implement parsing in Go by checking the numeric prefix (e.g., '1:' or '4:') rather than the text after. Use strings.HasPrefix(value, "1:") or strings.HasPrefix(value, "4:") to dispatch, and then map to a stable enum or constant (e.g., ChannelAdapter, RNIC). Avoid brittle text comparisons; validate that the prefix exists and handle unexpected values gracefully.
Applied to files:
pkg/host/rdma_provider.gopkg/host/mock/mock_rdma_provider.gopkg/host/host_test.gopkg/host/mock/mock_host.gopkg/host/host.go
📚 Learning: 2025-12-10T11:30:12.071Z
Learnt from: rollandf
Repo: k8snetworkplumbingwg/dra-driver-sriov PR: 32
File: pkg/host/host.go:818-879
Timestamp: 2025-12-10T11:30:12.071Z
Learning: In Linux sysfs, /sys/class/infiniband/<device>/node_type returns numeric format like "1: CA" not plain text "CA". The numeric prefixes are: 1 = CA (Channel Adapter - InfiniBand or RoCE), 4 = RNIC (iWARP). Code should use strings.HasPrefix() to parse these values.
Applied to files:
pkg/consts/consts.go
🧬 Code graph analysis (5)
pkg/devicestate/discovery.go (2)
pkg/host/host.go (1)
GetHelpers(148-151)pkg/consts/consts.go (12)
AttributeVendorID(35-35)AttributeDeviceID(36-36)AttributePFDeviceID(37-37)AttributePciAddress(32-32)AttributePFName(33-33)AttributeEswitchMode(34-34)AttributeVFID(38-38)AttributeNumaNode(42-42)AttributePCIeRoot(60-60)AttributeParentPciAddress(47-47)AttributeStandardPciAddress(44-44)AttributeRDMACapable(40-40)
pkg/host/host_test.go (2)
pkg/host/mock/mock_rdma_provider.go (2)
MockRdmaProvider(19-23)NewMockRdmaProvider(31-35)pkg/host/host.go (2)
Host(121-124)NewHost(127-132)
pkg/host/mock/mock_host.go (1)
pkg/cni/mock/mock_cni.go (2)
MockInterface(23-27)MockInterfaceMockRecorder(30-32)
pkg/devicestate/state.go (2)
pkg/consts/consts.go (1)
AttributeRDMACapable(40-40)pkg/host/host.go (1)
GetHelpers(148-151)
pkg/host/host.go (1)
pkg/host/rdma_provider.go (1)
RdmaProvider(27-30)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Lint
- GitHub Check: Build Container
- GitHub Check: Unit Tests
🔇 Additional comments (23)
go.mod (2)
83-84: Transitive dependencies look appropriate.The
vishvananda/netlinkandvishvananda/netnsare standard Go libraries for Linux network namespace and netlink operations, commonly used alongside RDMA libraries.
8-8: LGTM - rdmamap v1.1.0 for RDMA device discovery and mapping.The
github.com/Mellanox/rdmamaplibrary provides a map of RDMA devices with their character and network devices, making it suitable for RDMA device support in this DRA driver context. rdmamap v1.1.0 is stable and actively used in production K8s plugins.pkg/consts/consts.go (2)
40-40: LGTM - RDMA capability attribute follows existing conventions.The
AttributeRDMACapableconstant follows the established naming pattern and will be used to mark RDMA-capable VFs during discovery.
52-54: LGTM - Correct sysfs path for InfiniBand devices.
/sys/class/infinibandis the standard Linux sysfs path for discovering RDMA/InfiniBand devices and their properties.pkg/devicestate/state_test.go (2)
51-88: Test setup looks correct, but consider testing actual integration.The mock setup and lifecycle management are properly implemented. However, the
deviceInfovariable is created but not used to invoke any production code that would exercise the RDMA preparation logic.
170-193: Good coverage for non-RDMA device skip behavior.This test correctly verifies that
GetRDMACharDevicesis never called when a device is not RDMA-capable, which validates the guard condition in production code.pkg/devicestate/discovery_test.go (3)
81-82: LGTM - Existing tests updated for RDMA capability verification.The existing test cases are properly updated to include
VerifyRDMACapabilitymock expectations, ensuring the discovery flow remains testable after the RDMA changes.
255-321: Comprehensive test for RDMA-capable VF discovery.This test covers:
- Mixed RDMA capability (VF1 capable, VF2 not capable)
- Correct attribute propagation for RDMA-capable devices
- Verification of all standard and RDMA-specific attributes
The assertions at lines 314 and 320 properly verify the
AttributeRDMACapableboolean attribute for both cases.
323-357: Good error handling test for RDMA capability check failures.This test verifies that when
VerifyRDMACapabilityreturns an error, the discovery process:
- Does not fail entirely
- Defaults the device to non-RDMA capable (
false)This graceful degradation ensures device discovery continues even if RDMA capability detection fails for individual devices.
pkg/devicestate/state.go (1)
217-273: Overall RDMA integration logic is sound.The implementation correctly:
- Guards RDMA processing with capability check
- Handles errors gracefully without failing device preparation
- Adds appropriate CDI device nodes for RDMA character devices
- Sets environment variables following sriov-network-device-plugin conventions
The structure integrates well with the existing VFIO and vhost device handling patterns.
pkg/devicestate/discovery.go (2)
135-139: LGTM! Safe NUMA node parsing with reasonable default.The parsing logic correctly defaults to 0 when parsing fails, and the use of
ptr.Tois consistent with the rest of the codebase.
162-208: Well-structured attribute map with proper typing.The refactored attributes map is clean and readable. All attribute constants are correctly referenced from
constspackage, and the pointer types are consistently used viaptr.To. The newRDMACapableboolean attribute is properly integrated.pkg/host/host_test.go (3)
593-627: Good test coverage for GetRDMADeviceForPCI.The tests properly cover:
- Multiple RDMA devices returned
- Empty device list (returns nil)
- Non-RDMA-capable devices
Mock expectations are correctly configured with exact call counts.
629-662: Good test coverage for VerifyRDMACapability.Tests appropriately verify that capability returns
truewhen devices exist andfalsewhen the device list is empty.
664-744: Comprehensive test coverage for GetRDMACharDevices.The tests cover multiple scenarios including:
- Full set of character devices
- Subset of devices
- Empty device list
- Multiple RDMA devices with different character devices
This provides good confidence in the RDMA character device discovery functionality.
pkg/host/mock/mock_host.go (1)
1-10: Auto-generated mock file - no manual changes needed.This file is generated by
mockgenand follows standard gomock patterns. Ensure thego:generatedirective in the source file is run when interface methods are added or changed.pkg/host/rdma_provider.go (2)
23-30: Clean abstraction design for testability.The
RdmaProviderinterface provides a thin wrapper over the rdmamap library, enabling easy mocking in unit tests. Thego:generatedirective is correctly placed to auto-generate mocks.
32-47: Well-implemented provider with appropriate receiver type.The implementation correctly delegates to rdmamap's GetRdmaDevicesForPcidev and GetRdmaCharDevices functions, which both return []string. Using a value receiver on the empty
defaultRdmaProviderstruct is appropriate since there's no internal state.pkg/host/mock/mock_rdma_provider.go (1)
1-10: Auto-generated mock file - no manual changes needed.This file is correctly generated from
rdma_provider.goand follows standard gomock conventions. The mock is properly used inhost_test.gofor testing RDMA functionality.pkg/host/host.go (4)
113-118: LGTM!The RDMA interface methods are well-designed with appropriate return types and consistent naming following the existing interface patterns.
121-124: LGTM!Good use of dependency injection pattern by adding
rdmaProvideras a field, enabling testability through mock injection.
127-132: LGTM!Constructor properly initializes the
rdmaProviderwith the default implementation.
802-835: LGTM!
VerifyRDMACapabilityandGetRDMACharDevicesimplementations are clean and consistent with the file's patterns. Error handling inVerifyRDMACapabilityis properly structured to wrap any potential future errors fromGetRDMADeviceForPCI.
8444628 to
a007b52
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (5)
pkg/devicestate/state_test.go (3)
90-118: Tests verify mock behavior rather than production code.This test sets up mock expectations and then directly calls the mocked methods (
mockHost.GetRDMADeviceForPCI,mockHost.GetRDMACharDevices). It doesn't actually test any production code instate.gothat would use these mocks.A proper integration test should:
- Create a
Managerinstance with the mocked host- Call a real
Managermethod that internally invokes the RDMA host functions- Verify the end result (e.g., CDI spec contains expected device nodes)
Currently this only confirms that gomock returns what you told it to return.
120-136: Test verifies hardcoded expectations without exercising production code.This test constructs an
expectedEnvVarsmap and asserts that it has 5 entries and contains specific keys. No production code that generates environment variables is invoked. The test is essentially validating a map literal against itself.Consider invoking the actual environment variable generation logic from
state.goand verifying the output matches expectations.
229-244: Test verifiesContainSubstringrather than production device-type logic.This test only checks that paths contain expected substrings using Gomega matchers. If there's production code that parses device paths to identify types (e.g., extracting "uverbs" from "/dev/infiniband/uverbs0"), that function should be tested directly. Otherwise, this test adds no value.
pkg/host/host.go (2)
786-800: Consider usingV(2)for the success log to reduce noise.Line 797 logs at
Infolevel for every successful RDMA device lookup. On systems with many RDMA-capable devices, this could be verbose. The "no devices found" case usesV(2)which is appropriate; the success case should likely match.🔎 Suggested fix
- h.log.Info("GetRDMADeviceForPCI(): found RDMA devices", + h.log.V(2).Info("GetRDMADeviceForPCI(): found RDMA devices", "pciAddress", pciAddr, "rdmaDevices", rdmaDevices)
818-835: Same logging level consideration asGetRDMADeviceForPCI.Line 832 also logs at
Infolevel for successful lookups. For consistency and to reduce log noise, consider usingV(2)here as well.🔎 Suggested fix
- h.log.Info("GetRDMACharDevices(): found character devices", + h.log.V(2).Info("GetRDMACharDevices(): found character devices", "rdmaDevice", rdmaDeviceName, "charDevices", charDevices)
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
pkg/devicestate/state.go(1 hunks)pkg/devicestate/state_test.go(2 hunks)pkg/host/host.go(3 hunks)pkg/host/host_test.go(2 hunks)pkg/host/mock/mock_host.go(2 hunks)pkg/host/mock/mock_rdma_provider.go(1 hunks)pkg/host/rdma_provider.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/host/mock/mock_host.go
- pkg/devicestate/state.go
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-10T11:30:04.648Z
Learnt from: rollandf
Repo: k8snetworkplumbingwg/dra-driver-sriov PR: 32
File: pkg/host/host.go:818-879
Timestamp: 2025-12-10T11:30:04.648Z
Learning: In Linux sysfs, /sys/class/infiniband/<device>/node_type exposes values like '1: CA' or '4: RNIC'. Implement parsing in Go by checking the numeric prefix (e.g., '1:' or '4:') rather than the text after. Use strings.HasPrefix(value, "1:") or strings.HasPrefix(value, "4:") to dispatch, and then map to a stable enum or constant (e.g., ChannelAdapter, RNIC). Avoid brittle text comparisons; validate that the prefix exists and handle unexpected values gracefully.
Applied to files:
pkg/host/host_test.gopkg/host/host.gopkg/host/mock/mock_rdma_provider.gopkg/host/rdma_provider.go
🧬 Code graph analysis (1)
pkg/host/host_test.go (2)
pkg/host/mock/mock_rdma_provider.go (2)
MockRdmaProvider(19-23)NewMockRdmaProvider(31-35)pkg/host/host.go (2)
Host(121-124)NewHost(127-132)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Lint
- GitHub Check: Build Container
🔇 Additional comments (11)
pkg/devicestate/state_test.go (1)
138-169: Same pattern: tests mock returns rather than production integration.Similar to the previous tests, this sets up expectations and directly calls the mock. Consider consolidating these into integration tests that exercise
Managermethods.pkg/host/host_test.go (4)
582-591: LGTM! Proper test setup with mock injection.The test correctly initializes the mock controller, creates a mock RDMA provider, and injects it into the Host instance via
SetRdmaProvider. TheAfterEachproperly callsFinish()to verify expectations.
593-627: LGTM! Good coverage ofGetRDMADeviceForPCIscenarios.Tests cover:
- Multiple RDMA devices returned
- Empty list converted to nil
- Non-RDMA-capable device returning nil
These align with the production code logic in
host.go.
629-662: LGTM!VerifyRDMACapabilitytests appropriately verify boolean logic.
664-744: LGTM! ComprehensiveGetRDMACharDevicestest coverage.Tests cover full device lists, partial lists, empty returns (nil conversion), and differentiation across multiple RDMA devices. Well-structured test cases.
pkg/host/mock/mock_rdma_provider.go (1)
1-68: LGTM! Auto-generated mock implementation.This is standard
mockgen-generated code matching theRdmaProviderinterface. No manual review needed for generated code.pkg/host/host.go (4)
113-118: LGTM! Clean interface extension for RDMA operations.The new RDMA methods follow the existing interface patterns and provide a consistent API for RDMA device discovery and capability verification.
120-132: LGTM! Host struct extended with RDMA provider dependency.The
rdmaProviderfield and its initialization inNewHost()follow dependency injection patterns, enabling testability.
153-157: LGTM! Test injection method for RDMA provider.Clear documentation that this is primarily for unit testing. An alternative would be functional options in
NewHost(), but the current approach is pragmatic and sufficient.
802-816: LGTM!VerifyRDMACapabilitycorrectly wraps device lookup.Simple boolean wrapper that properly propagates errors and checks device presence.
pkg/host/rdma_provider.go (1)
27-47: LGTM! Clean abstraction layer for RDMA device discovery.The
RdmaProviderinterface provides a thin wrapper around therdmamaplibrary, enabling testability through mock injection. The interface methods correctly mirror the underlying library's API, which returns empty slices on failure rather than error values.
a007b52 to
354658e
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (10)
pkg/devicestate/state_test.go (3)
90-118: Tests exercise mocks directly rather than production code.The test sets up mock expectations and then calls
mockHost.GetRDMADeviceForPCI()andmockHost.GetRDMACharDevices()directly. This verifies the mock works but doesn't test any production logic. Consider testing the actualapplyConfigOnDevicefunction or similar production code path that uses these host methods.
120-136: Test validates manually constructed expectations without exercising production code.This test constructs the expected environment variables and then asserts they exist in the map that was just created. To provide value, consider calling the production code that generates these environment variables and validating the output matches expectations.
236-251: Test validates self-fulfilling assertions.The test verifies that
/dev/infiniband/uverbs0contains"uverbs", which is trivially true. To add value, consider testing the production code's switch statement logic fromstate.golines 256-265 that identifies device types from paths.pkg/devicestate/state.go (3)
256-265: Consider more precise device type matching.Using
strings.Containscould match incorrectly in edge cases. For example, a hypothetical path containing both substrings. Consider usingfilepath.Baseand prefix matching for more robust identification:🔎 Suggested improvement
+ baseName := filepath.Base(charDev) switch { - case strings.Contains(charDev, "uverbs"): + case strings.HasPrefix(baseName, "uverbs"): envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_%s_RDMA_UVERBS=%s", devicePrefix, rdmaDeviceSanitized, charDev)) - case strings.Contains(charDev, "umad"): + case strings.HasPrefix(baseName, "umad"): envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_%s_RDMA_UMAD=%s", devicePrefix, rdmaDeviceSanitized, charDev)) - case strings.Contains(charDev, "issm"): + case strings.HasPrefix(baseName, "issm"): envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_%s_RDMA_ISSM=%s", devicePrefix, rdmaDeviceSanitized, charDev)) - case strings.Contains(charDev, "rdma_cm"): + case baseName == "rdma_cm": envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_%s_RDMA_CM=%s", devicePrefix, rdmaDeviceSanitized, charDev)) }
268-269: Consider consistent log levels.Line 268 uses
logger.Infowhile similar messages (lines 224-225, 237-239) uselogger.V(2).Info. For consistency and to avoid verbose production logs, consider usingV(2)here as well.🔎 Suggested change
- logger.Info("Added RDMA character devices for device", + logger.V(2).Info("Added RDMA character devices for device", "device", pciAddress, "rdmaDevice", rdmaDevice, "charDevices", charDevices)
246-266: Deduplicate shared RDMA character devices in CDI spec.The code appends character devices without deduplication. When multiple RDMA devices (e.g.,
mlx5_5andmlx5_6) are present, shared devices like/dev/infiniband/rdma_cmwill be added multiple times. Consider deduplicatingdeviceNodesby path before appending, using a set or checking for existing entries to prevent redundant device node entries.pkg/host/host_test.go (2)
582-587: Type assertion without safety check.The type assertion at line 585 will panic if
NewHost()returns a type other than*host.Host. While this is unlikely in practice, consider using a safe assertion or verifying the test setup:🔎 Suggested improvement
BeforeEach(func() { mockCtrl = gomock.NewController(GinkgoT()) mockRdmaProvider = mock_host.NewMockRdmaProvider(mockCtrl) - hostImpl = host.NewHost().(*host.Host) + var ok bool + hostImpl, ok = host.NewHost().(*host.Host) + Expect(ok).To(BeTrue(), "NewHost() should return *Host") hostImpl.SetRdmaProvider(mockRdmaProvider) })
606-626: Duplicate test cases.The tests "should return nil when no RDMA devices exist" and "should return nil when device is not RDMA capable" have identical mock setups and assertions. From the production code's perspective, both scenarios result in an empty list from the provider. Consider consolidating into a single test or differentiating the scenarios if there's intended behavioral difference.
pkg/host/host.go (2)
797-799: Consider V(2) log level for routine RDMA device discovery.The
Infolevel log at line 797 will output for every RDMA-capable device during normal operation. Consider usingV(2).Infofor consistency with the "not found" case at line 793.🔎 Suggested change
- h.log.Info("GetRDMADeviceForPCI(): found RDMA devices", + h.log.V(2).Info("GetRDMADeviceForPCI(): found RDMA devices", "pciAddress", pciAddr, "rdmaDevices", rdmaDevices)
832-834: Same log level consideration as GetRDMADeviceForPCI.For consistency, consider using
V(2).Infohere as well.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
pkg/devicestate/state.go(1 hunks)pkg/devicestate/state_test.go(2 hunks)pkg/host/host.go(3 hunks)pkg/host/host_test.go(2 hunks)pkg/host/mock/mock_host.go(2 hunks)pkg/host/mock/mock_rdma_provider.go(1 hunks)pkg/host/rdma_provider.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/host/mock/mock_host.go
- pkg/host/mock/mock_rdma_provider.go
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-10T11:30:04.648Z
Learnt from: rollandf
Repo: k8snetworkplumbingwg/dra-driver-sriov PR: 32
File: pkg/host/host.go:818-879
Timestamp: 2025-12-10T11:30:04.648Z
Learning: In Linux sysfs, /sys/class/infiniband/<device>/node_type exposes values like '1: CA' or '4: RNIC'. Implement parsing in Go by checking the numeric prefix (e.g., '1:' or '4:') rather than the text after. Use strings.HasPrefix(value, "1:") or strings.HasPrefix(value, "4:") to dispatch, and then map to a stable enum or constant (e.g., ChannelAdapter, RNIC). Avoid brittle text comparisons; validate that the prefix exists and handle unexpected values gracefully.
Applied to files:
pkg/host/host_test.gopkg/host/host.gopkg/host/rdma_provider.go
🧬 Code graph analysis (3)
pkg/host/host_test.go (2)
pkg/host/mock/mock_rdma_provider.go (2)
MockRdmaProvider(19-23)NewMockRdmaProvider(31-35)pkg/host/host.go (2)
Host(121-124)NewHost(127-132)
pkg/devicestate/state.go (2)
pkg/consts/consts.go (1)
AttributeRDMACapable(40-40)pkg/host/host.go (1)
GetHelpers(148-151)
pkg/host/host.go (1)
pkg/host/rdma_provider.go (1)
RdmaProvider(27-30)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Build Container
- GitHub Check: Lint
🔇 Additional comments (8)
pkg/devicestate/state_test.go (1)
171-201: Improved test for non-RDMA device handling.The test now directly validates the conditional logic that determines whether RDMA preparation should occur, which is an improvement over the previous version. The condition at line 194 correctly mirrors the production code condition.
pkg/devicestate/state.go (1)
217-276: RDMA integration looks well-structured.The RDMA handling block correctly:
- Checks for RDMA capability before proceeding
- Handles errors gracefully without failing device preparation
- Includes RDMA device name in env var keys to prevent collisions with multiple RDMA devices
- Adds character devices to CDI spec for container access
pkg/host/host_test.go (1)
664-744: Comprehensive test coverage for GetRDMACharDevices.Good coverage of edge cases including:
- Full device list return
- Partial device list return
- Empty/nil returns
- Multiple RDMA devices with different character devices
pkg/host/rdma_provider.go (2)
27-42: Clean abstraction enabling testability.The
RdmaProviderinterface properly wraps the externalrdmamaplibrary, enabling mock injection for unit tests. The statelessdefaultRdmaProviderstruct appropriately uses value receivers.
19-21: Consider light maintenance status ofgithub.com/Mellanox/rdmamapdependency.The implementation depends on
github.com/Mellanox/rdmamap(v1.1.0), which is stable but has minimal ongoing maintenance (last release November 23, 2022). While the library is production-suitable, be aware that this lightly-maintained dependency may have limited community support for future issues or compatibility updates.pkg/host/host.go (3)
113-118: Clean interface extension for RDMA functionality.The new RDMA methods follow the existing interface patterns and provide a clear API for RDMA device discovery and capability verification.
153-157: Good design for test injection.Exposing
SetRdmaProvideron the concrete*Hosttype rather than theInterfacekeeps the main API clean while enabling mock injection in tests.
802-816: LGTM.
VerifyRDMACapabilitycorrectly delegates toGetRDMADeviceForPCIand properly handles potential future error cases.
354658e to
cb91e4d
Compare
|
Get pod names POD1=$(kubectl get pod -l app=sriov-rdma-test -o jsonpath='{.items[0].metadata.name}')
POD2=$(kubectl get pod -l app=sriov-rdma-test -o jsonpath='{.items[1].metadata.name}')
echo "Pod 1: $POD1"
echo "Pod 2: $POD2"
Pod 1: sriov-rdma-test-4hdhd
Pod 2: sriov-rdma-test-gnbsrCheck RDMA environment variables kubectl exec $POD1 -- env | grep RDMA
SRIOVNETWORK_0000_08_00_3_mlx53_RDMA_UMAD=/dev/infiniband/umad3
SRIOVNETWORK_0000_08_00_3_mlx53_RDMA_UVERBS=/dev/infiniband/uverbs3
SRIOVNETWORK_0000_08_00_3_mlx53_RDMA_CM=/dev/infiniband/rdma_cm
SRIOVNETWORK_0000_08_00_3_mlx53_RDMA_DEVICE=mlx5_3Check RDMA devices kubectl exec $POD1 -- ls -la /dev/infiniband/
kubectl exec $POD2 -- ls -la /dev/infiniband/
total 0
drwxr-xr-x 2 root root 100 Dec 21 13:42 .
drwxr-xr-x 6 root root 380 Dec 21 13:42 ..
crw-rw-rw- 1 root root 10, 122 Dec 21 13:42 rdma_cm
crw-rw-rw- 1 root root 231, 3 Dec 21 13:42 umad3
crw-rw-rw- 1 root root 231, 195 Dec 21 13:42 uverbs3
total 0
drwxr-xr-x 2 root root 100 Dec 21 13:42 .
drwxr-xr-x 6 root root 380 Dec 21 13:42 ..
crw-rw-rw- 1 root root 10, 122 Dec 21 13:42 rdma_cm
crw-rw-rw- 1 root root 231, 3 Dec 21 13:42 umad3
crw-rw-rw- 1 root root 231, 195 Dec 21 13:42 uverbs3 |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
pkg/host/host.go (1)
784-835: Consider adding input validation for RDMA methods.The RDMA device methods work correctly but could benefit from basic input validation:
GetRDMADeviceForPCIandVerifyRDMACapability: validate PCI address format (e.g., check for domain:bus:device.function pattern)GetRDMACharDevices: validate rdmaDeviceName is non-emptyThis would provide earlier error detection and clearer failure messages for malformed inputs, though the current implementation will handle invalid inputs gracefully by returning empty results.
Example validation approach
func (h *Host) GetRDMADeviceForPCI(pciAddr string) ([]string, error) { + if pciAddr == "" { + return nil, fmt.Errorf("pciAddr cannot be empty") + } + // Optional: validate PCI address format (DDDD:BB:DD.F) + if len(strings.Split(pciAddr, ":")) != 3 { + return nil, fmt.Errorf("invalid PCI address format: %s", pciAddr) + } h.log.V(2).Info("GetRDMADeviceForPCI(): getting RDMA devices for PCI address", "device", pciAddr)func (h *Host) GetRDMACharDevices(rdmaDeviceName string) ([]string, error) { + if rdmaDeviceName == "" { + return nil, fmt.Errorf("rdmaDeviceName cannot be empty") + } h.log.V(2).Info("GetRDMACharDevices(): getting character devices for RDMA device", "rdmaDevice", rdmaDeviceName)
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
pkg/devicestate/state.go(1 hunks)pkg/devicestate/state_test.go(2 hunks)pkg/host/host.go(3 hunks)pkg/host/host_test.go(2 hunks)pkg/host/mock/mock_host.go(2 hunks)pkg/host/mock/mock_rdma_provider.go(1 hunks)pkg/host/rdma_provider.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- pkg/devicestate/state_test.go
- pkg/host/mock/mock_host.go
- pkg/devicestate/state.go
- pkg/host/host_test.go
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-10T11:30:04.648Z
Learnt from: rollandf
Repo: k8snetworkplumbingwg/dra-driver-sriov PR: 32
File: pkg/host/host.go:818-879
Timestamp: 2025-12-10T11:30:04.648Z
Learning: In Linux sysfs, /sys/class/infiniband/<device>/node_type exposes values like '1: CA' or '4: RNIC'. Implement parsing in Go by checking the numeric prefix (e.g., '1:' or '4:') rather than the text after. Use strings.HasPrefix(value, "1:") or strings.HasPrefix(value, "4:") to dispatch, and then map to a stable enum or constant (e.g., ChannelAdapter, RNIC). Avoid brittle text comparisons; validate that the prefix exists and handle unexpected values gracefully.
Applied to files:
pkg/host/host.gopkg/host/mock/mock_rdma_provider.gopkg/host/rdma_provider.go
🧬 Code graph analysis (1)
pkg/host/host.go (1)
pkg/host/rdma_provider.go (1)
RdmaProvider(27-30)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Build Container
- GitHub Check: Lint
🔇 Additional comments (5)
pkg/host/rdma_provider.go (1)
23-47: LGTM! Clean abstraction for RDMA provider.The RdmaProvider interface and default implementation follow best practices for dependency injection and testability. The delegation to the rdmamap library is straightforward, and the factory function appropriately returns the interface type.
pkg/host/mock/mock_rdma_provider.go (1)
1-68: Generated mock implementation is correct.This gomock-generated mock correctly implements the RdmaProvider interface with the expected methods and recorder pattern.
pkg/host/host.go (3)
114-117: Well-designed RDMA interface additions.The three new methods provide a clear and consistent API for RDMA device discovery and verification. The naming follows the existing convention and the separation of concerns (device lookup vs. capability verification) is appropriate.
122-132: Proper initialization of RDMA provider.The rdmaProvider field is correctly initialized in NewHost() and uses the interface type for testability. The initialization pattern is consistent with the rest of the codebase.
153-157: Good test injection mechanism.SetRdmaProvider enables clean dependency injection for unit tests. The method is well-documented and follows a common pattern for test seams.
cb91e4d to
d3c8cdb
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
pkg/devicestate/state_test.go (1)
49-81: Test validates boolean logic but doesn't exercise production code.The test replicates the conditional expression (line 73) but never invokes the actual production function (e.g.,
applyConfigOnDevice). This only verifies that Go evaluates boolean expressions correctly, not that your RDMA preparation logic behaves as expected.Consider updating the test to actually call the production code path with a non-RDMA device and verify that RDMA-related methods (e.g.,
GetRDMADeviceForPCI,GetRDMACharDevices) are not invoked. This would provide meaningful coverage of the skip logic.🔎 Suggested approach
Instead of duplicating the conditional logic:
- Set up a mock host with expectations that RDMA methods should not be called
- Call
applyConfigOnDeviceor the relevant production function with a non-RDMA device- Verify the mock expectations (that RDMA methods were not called)
- Assert the returned
PreparedDevicedoes not contain RDMA device nodes or environment variablesThis would test the actual behavior rather than the boolean expression.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
pkg/devicestate/state.go(1 hunks)pkg/devicestate/state_test.go(2 hunks)pkg/host/host.go(3 hunks)pkg/host/host_test.go(2 hunks)pkg/host/mock/mock_host.go(2 hunks)pkg/host/mock/mock_rdma_provider.go(1 hunks)pkg/host/rdma_provider.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- pkg/host/mock/mock_host.go
- pkg/host/mock/mock_rdma_provider.go
- pkg/host/rdma_provider.go
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-10T11:30:04.648Z
Learnt from: rollandf
Repo: k8snetworkplumbingwg/dra-driver-sriov PR: 32
File: pkg/host/host.go:818-879
Timestamp: 2025-12-10T11:30:04.648Z
Learning: In Linux sysfs, /sys/class/infiniband/<device>/node_type exposes values like '1: CA' or '4: RNIC'. Implement parsing in Go by checking the numeric prefix (e.g., '1:' or '4:') rather than the text after. Use strings.HasPrefix(value, "1:") or strings.HasPrefix(value, "4:") to dispatch, and then map to a stable enum or constant (e.g., ChannelAdapter, RNIC). Avoid brittle text comparisons; validate that the prefix exists and handle unexpected values gracefully.
Applied to files:
pkg/host/host_test.gopkg/host/host.go
🧬 Code graph analysis (3)
pkg/devicestate/state_test.go (1)
pkg/consts/consts.go (2)
AttributePciAddress(32-32)AttributeRDMACapable(40-40)
pkg/host/host_test.go (2)
pkg/host/mock/mock_rdma_provider.go (2)
MockRdmaProvider(19-23)NewMockRdmaProvider(31-35)pkg/host/host.go (2)
Host(121-124)NewHost(127-132)
pkg/host/host.go (2)
pkg/host/rdma_provider.go (1)
RdmaProvider(27-30)pkg/cni/interface.go (1)
Interface(30-33)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Build Container
- GitHub Check: Lint
🔇 Additional comments (4)
pkg/devicestate/state.go (1)
217-276: RDMA integration looks solid; collision issue resolved.The RDMA handling correctly addresses the previous environment variable collision concern by including both
devicePrefixandrdmaDeviceSanitizedin environment variable keys (lines 258-264, 272-273), ensuring multiple RDMA devices per PCI address won't overwrite each other's variables.Error handling appropriately continues on failure (lines 231-234, 236-240), and the integration follows the established pattern for VFIO and vhost device handling.
pkg/host/host_test.go (1)
593-766: Comprehensive RDMA test coverage.The test suite thoroughly covers:
- Happy paths for RDMA device discovery and character device retrieval
- Edge cases (empty lists, non-RDMA devices)
- Input validation (empty/invalid PCI addresses, empty device names)
- Multi-device scenarios to ensure different devices return different character devices
The gomock-based approach properly isolates the RDMA provider for focused unit testing.
pkg/host/host.go (2)
153-157: LGTM - Test injection pattern.The
SetRdmaProvidermethod appropriately enables mock injection for unit testing while documenting its primary purpose. This is a clean approach to dependency injection.
784-849: RDMA functions are well-structured with appropriate logging and error handling.The three RDMA methods follow consistent patterns:
- Input validation for empty/malformed parameters
- Delegation to the
rdmaProviderabstraction- Appropriate return of
nilfor "not found" vs. errors for invalid input- Logging at suitable verbosity levels (V(2) for debug, Info for discoveries)
The implementations correctly handle the case where RDMA devices don't exist by returning
nilrather than errors, which aligns with the optional nature of RDMA capability.
| BeforeEach(func() { | ||
| mockCtrl = gomock.NewController(GinkgoT()) | ||
| mockRdmaProvider = mock_host.NewMockRdmaProvider(mockCtrl) | ||
| hostImpl = host.NewHost().(*host.Host) | ||
| hostImpl.SetRdmaProvider(mockRdmaProvider) | ||
| }) |
There was a problem hiding this comment.
Type assertion could panic without check.
Line 585 performs an unchecked type assertion that would panic if NewHost() returns an unexpected implementation. While this works in practice since NewHost() returns *Host, a safer approach would validate the assertion.
🔎 Safer type assertion
- hostImpl = host.NewHost().(*host.Host)
+ var ok bool
+ hostImpl, ok = host.NewHost().(*host.Host)
+ Expect(ok).To(BeTrue(), "NewHost() should return *host.Host")🤖 Prompt for AI Agents
In pkg/host/host_test.go around lines 582 to 587, the unchecked type assertion
host.NewHost().(*host.Host) can panic if NewHost() returns a different
implementation; replace it with a safe comma-ok assertion (or type switch) and
fail the test if the assertion is false. Specifically, assign the result to an
interface variable, do v, ok := result.(*host.Host), use Ginkgo/Gomega (e.g.,
Expect(ok).To(BeTrue()) or Failf) to stop the test on mismatch, and only call
v.SetRdmaProvider(...) when ok is true.
54c02d4 to
0958b26
Compare
SchSeba
left a comment
There was a problem hiding this comment.
nice!
I reviewed only the second commit as the first one is implemented in the other PR
pkg/devicestate/state.go
Outdated
| } | ||
|
|
||
| // If device is RDMA capable, add RDMA character devices | ||
| if rdmaCapableAttr, ok := deviceInfo.Attributes[consts.AttributeRDMACapable]; ok && rdmaCapableAttr.BoolValue != nil && *rdmaCapableAttr.BoolValue { |
There was a problem hiding this comment.
can we move this into a function that returns the mounts?
it will be good to move the others also into functions but that can be done in a separate PR :)
pkg/devicestate/state.go
Outdated
| if rdmaCapableAttr, ok := deviceInfo.Attributes[consts.AttributeRDMACapable]; ok && rdmaCapableAttr.BoolValue != nil && *rdmaCapableAttr.BoolValue { | ||
| rdmaDevices, err := host.GetHelpers().GetRDMADeviceForPCI(pciAddress) | ||
| if err != nil { | ||
| logger.Error(err, "Failed to get RDMA devices for PCI address", |
There was a problem hiding this comment.
I think we should return an error here and not let the pod start without the RDMA devices.
if the VF is reported as RDMA this means we manage to run the GetRDMADeviceForPCI in the discovery.
pkg/devicestate/state.go
Outdated
| if err != nil { | ||
| logger.Error(err, "Failed to get RDMA devices for PCI address", | ||
| "device", pciAddress) | ||
| } else if len(rdmaDevices) > 0 { |
There was a problem hiding this comment.
when we move into a function we can remove the else if
pkg/devicestate/state.go
Outdated
| if err != nil { | ||
| logger.Error(err, "Failed to get RDMA character devices, skipping", | ||
| "device", pciAddress, "rdmaDevice", rdmaDevice) | ||
| continue |
There was a problem hiding this comment.
is it possible for a device to report RDMA capable and not having the char devices?
In general I don't want to be in a place where the user request RDMA and at the end we allow the pod to start without the needed mounts, meaning RDMA will not really work for them.
pkg/devicestate/state.go
Outdated
|
|
||
| // Add environment variable for each character device type | ||
| // Include RDMA device name to avoid collisions with multiple RDMA devices | ||
| switch { |
There was a problem hiding this comment.
we can leave this one for now, but I am not really sure someone will really use the ENVs.
pkg/devicestate/state.go
Outdated
| "device", pciAddress, "rdmaDevice", rdmaDevice, "charDevices", charDevices) | ||
|
|
||
| // Add RDMA device name to environment variables | ||
| envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_%s_RDMA_DEVICE=%s", |
There was a problem hiding this comment.
can we have a new lets of envs for RDMA and add it into the log on line (268) 0958b26#diff-c715c6df7a2299a8c29472a78a7ca23f6bc19143e636875e54a0917b0a443ba3R268
after that we can append
| }) | ||
| }) | ||
|
|
||
| Context("RDMA Device Preparation", func() { |
There was a problem hiding this comment.
please add a test to validate the mounts and the envs exist
pkg/host/host.go
Outdated
| charDevices := h.rdmaProvider.GetRdmaCharDevices(rdmaDeviceName) | ||
|
|
||
| if len(charDevices) == 0 { | ||
| h.log.V(2).Info("GetRDMACharDevices(): no character devices found", "rdmaDevice", rdmaDeviceName) |
There was a problem hiding this comment.
please remove from debug logs
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (3)
pkg/devicestate/state_test.go (1)
54-86: Test validates attributes but doesn't exercise production code path.This test creates a device with
rdmaCapable=falseand then validates the conditional logic in comments (lines 75-78), but it doesn't actually invoke the production code that uses this logic. While the test correctly verifies that theshouldPrepareRDMAcondition evaluates tofalse, it would be more valuable to call the actual device preparation function with this non-RDMA device to verify that RDMA preparation is truly skipped in the production code path.💡 Consider enhancing the test
Consider invoking the actual device preparation/state management function that uses this conditional logic, so the test verifies the production behavior rather than just the condition evaluation:
shouldPrepareRDMA := exists && rdmaCapable.BoolValue != nil && *rdmaCapable.BoolValue Expect(shouldPrepareRDMA).To(BeFalse(), "RDMA preparation should be skipped for non-RDMA capable devices") - // When this condition is false, the production code never calls: - // - GetRDMADeviceForPCI - // - GetRDMACharDevices - // This test verifies the condition evaluates correctly for non-RDMA devices + // Actually invoke the production function to verify RDMA preparation is skipped + // (Add appropriate function call here based on your production API) })pkg/devicestate/state.go (1)
329-330: Consider usingV(2).Infofor per-device logging to reduce noise.This log statement runs for every RDMA device processed. Using
logger.V(2).Infowould be more consistent with similar logging patterns in the codebase (e.g., line 285) and reduce log noise in production environments with many RDMA devices.🔎 Suggested change
- logger.Info("Added RDMA character devices for device", + logger.V(2).Info("Added RDMA character devices for device", "device", pciAddress, "rdmaDevice", rdmaDevice, "charDevices", charDevices, "envs", envs)pkg/host/host.go (1)
811-812: Consider usingV(2).Infofor RDMA discovery result logging.The "getting" operations log at
V(2)level (lines 802, 818, 841), but the result logging usesInfolevel. During device discovery with many RDMA devices, this could produce verbose output. Consider usingV(2).Infofor consistency with similar discovery patterns elsewhere in the codebase.🔎 Suggested changes
- h.log.Info("GetRDMADeviceForPCI(): found RDMA devices", + h.log.V(2).Info("GetRDMADeviceForPCI(): found RDMA devices", "pciAddress", pciAddr, "rdmaDevices", rdmaDevices)- h.log.Info("VerifyRDMACapability(): RDMA capability check result", + h.log.V(2).Info("VerifyRDMACapability(): RDMA capability check result", "device", pciAddr, "rdmaCapable", hasRDMA)- h.log.Info("GetRDMACharDevices(): found character devices", + h.log.V(2).Info("GetRDMACharDevices(): found character devices", "rdmaDevice", rdmaDeviceName, "charDevices", charDevices)Also applies to: 826-827, 851-852
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (11)
go.modpkg/consts/consts.gopkg/devicestate/discovery.gopkg/devicestate/discovery_test.gopkg/devicestate/state.gopkg/devicestate/state_test.gopkg/host/host.gopkg/host/host_test.gopkg/host/mock/mock_host.gopkg/host/mock/mock_rdma_provider.gopkg/host/rdma_provider.go
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/consts/consts.go
- pkg/host/rdma_provider.go
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-12-10T11:30:12.071Z
Learnt from: rollandf
Repo: k8snetworkplumbingwg/dra-driver-sriov PR: 32
File: pkg/host/host.go:818-879
Timestamp: 2025-12-10T11:30:12.071Z
Learning: In Linux sysfs, /sys/class/infiniband/<device>/node_type returns numeric format like "1: CA" not plain text "CA". The numeric prefixes are: 1 = CA (Channel Adapter - InfiniBand or RoCE), 4 = RNIC (iWARP). Code should use strings.HasPrefix() to parse these values.
Applied to files:
pkg/devicestate/discovery.go
📚 Learning: 2025-12-10T11:30:04.648Z
Learnt from: rollandf
Repo: k8snetworkplumbingwg/dra-driver-sriov PR: 32
File: pkg/host/host.go:818-879
Timestamp: 2025-12-10T11:30:04.648Z
Learning: In Linux sysfs, /sys/class/infiniband/<device>/node_type exposes values like '1: CA' or '4: RNIC'. Implement parsing in Go by checking the numeric prefix (e.g., '1:' or '4:') rather than the text after. Use strings.HasPrefix(value, "1:") or strings.HasPrefix(value, "4:") to dispatch, and then map to a stable enum or constant (e.g., ChannelAdapter, RNIC). Avoid brittle text comparisons; validate that the prefix exists and handle unexpected values gracefully.
Applied to files:
pkg/host/mock/mock_host.gopkg/host/host.gopkg/host/mock/mock_rdma_provider.gopkg/host/host_test.go
🧬 Code graph analysis (4)
pkg/host/mock/mock_host.go (1)
pkg/cni/mock/mock_cni.go (2)
MockInterface(23-27)MockInterfaceMockRecorder(30-32)
pkg/host/host.go (1)
pkg/host/rdma_provider.go (1)
RdmaProvider(27-30)
pkg/host/host_test.go (2)
pkg/host/mock/mock_rdma_provider.go (2)
MockRdmaProvider(19-23)NewMockRdmaProvider(31-35)pkg/host/host.go (2)
Host(121-124)NewHost(127-132)
pkg/devicestate/state.go (2)
pkg/consts/consts.go (1)
AttributeRDMACapable(40-40)pkg/host/host.go (1)
GetHelpers(148-151)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Lint
- GitHub Check: Unit Tests
- GitHub Check: Build Container
🔇 Additional comments (23)
pkg/devicestate/discovery_test.go (4)
81-82: LGTM! RDMA capability checks are consistently added.The mock expectations for
VerifyRDMACapabilityare properly added for each VF across all test scenarios, ensuring comprehensive coverage of the RDMA capability detection flow.Also applies to: 154-154, 156-156, 205-205, 242-242, 420-420, 550-550
211-213: NUMA semantic change is intentional and well-documented.The NUMA node default has been changed from 0 to -1 to indicate "not supported" when NUMA detection fails. This aligns with standard Linux conventions and is properly documented in the comments.
255-321: Excellent test coverage for RDMA-capable VFs.The test thoroughly validates:
- Mixed RDMA-capable and non-RDMA-capable VFs from the same PF
- Correct attribute population for RDMA-capable devices
- Proper handling of Mellanox-specific device IDs
323-357: Good error handling test for RDMA capability checks.The test correctly verifies that RDMA capability check failures default to
falserather than causing the discovery process to fail, ensuring graceful degradation.pkg/devicestate/discovery.go (4)
79-84: NUMA handling updated correctly.The change to default NUMA node to
-1(not supported) instead of a fabricated value is correct and aligns with standard Linux conventions. The comment clearly explains the semantics.
136-145: NUMA parsing with proper error handling.The NUMA node parsing correctly handles errors by defaulting to
-1and includes informative logging. The use ofptr.To(numaNodeInt)is appropriate for the int pointer attribute.
151-157: RDMA capability check with graceful error handling.The per-VF RDMA capability verification correctly defaults to
falseon error, ensuring the discovery process continues even if RDMA checks fail. The error is logged appropriately.
169-209: Attributes map construction is clean and maintainable.The refactoring to build attributes in a map before creating the device resource improves readability and makes it easier to add new attributes in the future. All attributes including the new RDMA capability flag are properly populated.
pkg/devicestate/state_test.go (3)
113-151: Comprehensive test for RDMA device handling.The test properly validates:
- Device nodes for all RDMA character device types (uverbs, umad, issm, rdma_cm)
- Environment variable naming and values
- Correct mock expectations
The assertions verify both the count and content of device nodes and environment variables.
153-183: Good coverage for multiple RDMA devices scenario.The test correctly validates that multiple RDMA devices (mlx5_0, mlx5_1) each generate their own device nodes and environment variables with proper naming conventions.
185-259: Excellent error path coverage.All critical error scenarios are tested:
- GetRDMADeviceForPCI failure
- No RDMA devices found (empty list)
- GetRDMACharDevices failure
- No character devices found
- Empty RDMA device name validation
Each test properly verifies error propagation and nil returns.
pkg/host/host_test.go (3)
593-662: Excellent PCI address validation coverage.The tests thoroughly validate:
- Valid RDMA device returns
- Empty device lists returning nil
- Empty PCI address validation
- Multiple invalid format scenarios (no colons, too few parts, missing dot, too many parts)
The error messages are descriptive and helpful for debugging.
664-697: VerifyRDMACapability tests are comprehensive.The tests correctly validate:
- True when RDMA devices exist
- False when device list is empty
- Proper error handling
The distinction between "no capabilities" and "empty list" scenarios ensures robustness.
699-786: GetRDMACharDevices tests cover all critical scenarios.Excellent test coverage including:
- Complete set of character devices (uverbs, umad, issm, rdma_cm)
- Partial device sets
- Empty results returning nil
- Multiple RDMA devices with different character devices
- Empty device name validation
The tests properly use
ContainElementsfor unordered matching andNotTo(Equal())for differentiation.go.mod (1)
6-6: v1.1.0 is the latest stable version with no known security vulnerabilities.Verification confirms that v1.1.0 (released August 11, 2022) is the current and latest stable release of Mellanox/rdmamap. No security advisories or CVEs have been published for this Go library. The repository is lightly maintained with only 3 releases total, but the dependency is suitable for use.
pkg/host/mock/mock_host.go (1)
190-218: LGTM - Generated mock methods follow consistent patterns.The new RDMA-related mock methods (
GetRDMACharDevices,GetRDMADeviceForPCI,VerifyRDMACapability) are correctly generated and follow the same gomock patterns as existing methods in the file.Also applies to: 377-391
pkg/devicestate/state.go (2)
218-226: LGTM - RDMA capability check integrates well with existing flow.The RDMA handling is properly gated behind the
AttributeRDMACapableboolean check, and errors propagate correctly to prevent pod startup without required RDMA devices.
267-338: LGTM - handleRDMADevice correctly enforces RDMA device availability.The function properly:
- Returns errors when RDMA devices or character devices are not found (addressing past review feedback)
- Uses unique environment variable keys per RDMA device to avoid collisions
- Builds CDI device nodes for all RDMA character devices
pkg/host/mock/mock_rdma_provider.go (1)
1-68: LGTM - Correctly generated mock for RdmaProvider interface.The mock implementation follows standard gomock patterns and correctly matches the
RdmaProviderinterface defined inrdma_provider.go.pkg/host/host.go (4)
113-117: LGTM - Interface additions for RDMA functionality.The new RDMA methods are well-defined and follow the existing interface patterns.
121-131: LGTM - Host struct and initialization updated for RDMA support.The
rdmaProviderfield andSetRdmaProvidermethod enable proper dependency injection for testing.Also applies to: 153-157
786-814: LGTM - Robust PCI address validation.The validation correctly enforces the
DDDD:BB:DD.Fformat by checking for 3 colon-separated parts and a dot-separated device.function suffix.
816-830: LGTM - VerifyRDMACapability and GetRDMACharDevices implementations.Both functions correctly delegate to the rdmaProvider and handle empty results appropriately.
Also applies to: 832-854
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (3)
pkg/devicestate/state_test.go (1)
96-111: Unnecessaryhost.GetHelpers()call on line 100.The call
_ = host.GetHelpers()appears to force initialization of the global helpers, but the result is immediately discarded. SinceorigHelpers = host.Helperson line 101 captures the current value anyway, this call may be redundant unless there's a specific initialization side effect needed.🔎 Consider removing the unused call
BeforeEach(func() { mockCtrl = gomock.NewController(GinkgoT()) mockHost = mock_host.NewMockInterface(mockCtrl) // Save original helpers and replace with mock - _ = host.GetHelpers() + host.GetHelpers() // Ensure helpers are initialized before capturing origHelpers = host.Helpers host.Helpers = mockHostOr if the initialization is truly not needed:
BeforeEach(func() { mockCtrl = gomock.NewController(GinkgoT()) mockHost = mock_host.NewMockInterface(mockCtrl) // Save original helpers and replace with mock - _ = host.GetHelpers() origHelpers = host.Helpers host.Helpers = mockHostpkg/devicestate/state.go (1)
288-327: Potential duplicate device nodes for shared character devices likerdma_cm.The
/dev/infiniband/rdma_cmdevice is typically a single shared device across all RDMA devices on the system. When iterating over multiple RDMA devices, this same path could be added todeviceNodesmultiple times, resulting in duplicate entries in the CDI spec.While CDI may handle duplicates gracefully, consider deduplicating device nodes or tracking already-added paths.
🔎 Suggested deduplication approach
func (s *Manager) handleRDMADevice(ctx context.Context, pciAddress, deviceName string) ([]*cdispec.DeviceNode, []string, error) { logger := klog.FromContext(ctx).WithName("handleRDMADevice") var deviceNodes []*cdispec.DeviceNode var envs []string + addedDevicePaths := make(map[string]bool) // ... existing code ... // Add each character device to the CDI spec for _, charDev := range charDevices { + // Avoid duplicate device nodes (e.g., rdma_cm is shared) + if addedDevicePaths[charDev] { + continue + } + addedDevicePaths[charDev] = true + deviceNodes = append(deviceNodes, &cdispec.DeviceNode{ Path: charDev, HostPath: charDev, Type: "c", // character device })Could you verify if
/dev/infiniband/rdma_cmis indeed shared across RDMA devices and whether duplicate entries in CDI specs cause any runtime issues?pkg/host/host.go (1)
835-854: Consider adjusting log levels for RDMA character device lookups.The
GetRDMACharDevicesfunction logs atInfolevel (lines 841, 847, 851-852) for every call. Since this is invoked per RDMA device during container setup, it could generate significant log volume. Consider usingV(2).Infofor the entry log (line 841) to align with other similar methods.🔎 Suggested log level adjustment
- h.log.Info("GetRDMACharDevices(): getting character devices for RDMA device", "rdmaDevice", rdmaDeviceName) + h.log.V(2).Info("GetRDMACharDevices(): getting character devices for RDMA device", "rdmaDevice", rdmaDeviceName)The "found character devices" log (line 851) at Info level is appropriate for visibility of successful RDMA setup.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
pkg/devicestate/state.gopkg/devicestate/state_test.gopkg/host/host.gopkg/host/host_test.gopkg/host/mock/mock_host.gopkg/host/mock/mock_rdma_provider.gopkg/host/rdma_provider.go
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-10T11:30:04.648Z
Learnt from: rollandf
Repo: k8snetworkplumbingwg/dra-driver-sriov PR: 32
File: pkg/host/host.go:818-879
Timestamp: 2025-12-10T11:30:04.648Z
Learning: In Linux sysfs, /sys/class/infiniband/<device>/node_type exposes values like '1: CA' or '4: RNIC'. Implement parsing in Go by checking the numeric prefix (e.g., '1:' or '4:') rather than the text after. Use strings.HasPrefix(value, "1:") or strings.HasPrefix(value, "4:") to dispatch, and then map to a stable enum or constant (e.g., ChannelAdapter, RNIC). Avoid brittle text comparisons; validate that the prefix exists and handle unexpected values gracefully.
Applied to files:
pkg/host/mock/mock_host.gopkg/host/host.gopkg/host/host_test.gopkg/host/rdma_provider.gopkg/host/mock/mock_rdma_provider.go
🧬 Code graph analysis (5)
pkg/devicestate/state_test.go (2)
pkg/consts/consts.go (2)
AttributePciAddress(32-32)AttributeRDMACapable(40-40)pkg/host/host.go (3)
Interface(74-118)GetHelpers(148-151)Helpers(136-136)
pkg/devicestate/state.go (2)
pkg/consts/consts.go (1)
AttributeRDMACapable(40-40)pkg/host/host.go (1)
GetHelpers(148-151)
pkg/host/mock/mock_host.go (1)
pkg/cni/mock/mock_cni.go (2)
MockInterface(23-27)MockInterfaceMockRecorder(30-32)
pkg/host/host.go (1)
pkg/host/rdma_provider.go (1)
RdmaProvider(27-30)
pkg/host/host_test.go (2)
pkg/host/mock/mock_rdma_provider.go (2)
MockRdmaProvider(19-23)NewMockRdmaProvider(31-35)pkg/host/host.go (2)
Host(121-124)NewHost(127-132)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Unit Tests
- GitHub Check: Lint
- GitHub Check: Build Container
🔇 Additional comments (9)
pkg/host/rdma_provider.go (1)
1-47: Clean abstraction for RDMA provider dependency injection.The interface design is well-structured for testability. The wrapper pattern around
rdmamaplibrary allows easy mocking in unit tests while keeping the production implementation thin.Minor observation: The interface methods don't return errors, which aligns with the underlying
rdmamaplibrary signatures. Callers should handle empty slice returns as "not found" conditions.pkg/devicestate/state_test.go (1)
113-259: Comprehensive test coverage forhandleRDMADevice.The test suite thoroughly covers:
- Single RDMA device with all character device types (uverbs, umad, issm, rdma_cm)
- Multiple RDMA devices per PCI address
- Error propagation from
GetRDMADeviceForPCIandGetRDMACharDevices- Edge cases: empty device lists
The environment variable format validation (e.g.,
SRIOVNETWORK_device_1_mlx50_RDMA_UVERBS) confirms the sanitization logic works correctly with underscores and hyphens.pkg/host/mock/mock_host.go (1)
190-218: Generated mock methods correctly match the interface.The new RDMA-related mock methods (
GetRDMACharDevices,GetRDMADeviceForPCI,VerifyRDMACapability) are correctly generated and align with theInterfacedefinition inhost.go. No issues with the generated code.Also applies to: 378-391
pkg/devicestate/state.go (2)
218-226: RDMA integration point is correctly placed.The RDMA handling is appropriately integrated after VFIO and vhost processing. The condition check correctly validates the RDMA capability attribute before invoking
handleRDMADevice, and errors are propagated to prevent pod startup with incomplete RDMA configuration.
329-335: LGTM - Environment variable generation follows the established pattern.The RDMA device name is correctly appended to environment variables after processing all character devices for each RDMA device, maintaining consistency with the naming convention.
pkg/host/host_test.go (1)
593-786: Excellent test coverage for RDMA device functions.The test suite comprehensively validates:
- Successful RDMA device and character device retrieval
- Nil returns for non-RDMA-capable devices
- Input validation for empty and malformed PCI addresses
- PCI address format validation (DDDD:BB:DD.F)
- RDMA capability verification logic
The tests correctly verify that empty results return
nilrather than empty slices, matching the production implementation.pkg/host/mock/mock_rdma_provider.go (1)
1-68: Generated mock correctly implements RdmaProvider interface.The mock is properly generated from
rdma_provider.goand follows gomock conventions. The methods correctly return[]stringwithout error, matching the interface definition.pkg/host/host.go (2)
113-118: Clean interface extension and dependency injection support.The RDMA methods are well-placed within the Interface definition, and
SetRdmaProviderprovides a clean mechanism for injecting mock providers in tests without modifying the global Helpers pattern.Also applies to: 153-157
786-814: PCI address validation is correctly strict.The validation now properly requires the full
DDDD:BB:DD.Fformat with three colon-separated parts and a dot-separated device.function, addressing previous review feedback.Returning
nil, nilfor empty results (lines 807-809) is appropriate since "no RDMA devices" is not an error condition at this layer - callers handle this based on their requirements.
pkg/devicestate/state.go
Outdated
| } | ||
|
|
||
| // If device is RDMA capable, add RDMA character devices | ||
| if rdmaCapableAttr, ok := deviceInfo.Attributes[consts.AttributeRDMACapable]; ok && rdmaCapableAttr.BoolValue != nil && *rdmaCapableAttr.BoolValue { |
There was a problem hiding this comment.
nit: i would move this long check into handleRDMADevice()
it can return early (with empty list), then always append deviceNodes and envs
pkg/devicestate/state.go
Outdated
| return nil, fmt.Errorf("error handling RDMA device: %w", err) | ||
| } | ||
| deviceNodes = append(deviceNodes, rdmaDeviceNodes...) | ||
| envs = append(envs, rdmaEnvs...) |
There was a problem hiding this comment.
this is a general question about the environment variables we add per device.
with current implementation we have a bunch for each PCI device prepared which contains information about the various mounts (vfio, tun, vhost, rdma) as well as net-attach-def config name and the pci address.
should we have a single environment variable per device with a json map like we have in device plugin ?
e.g: SRIOVNETWORK_03_01_00_2_INFO={"rdma": {...}, "vfio": {....}, "vhost": {...}}
for this PR i think we need to keep the current style, then if we agree, align it all to a different format in a different PR.
pkg/devicestate/state.go
Outdated
| logger.V(2).Info("Device is RDMA capable, adding RDMA character devices", | ||
| "device", pciAddress, "rdmaDevices", rdmaDevices) | ||
|
|
||
| for _, rdmaDevice := range rdmaDevices { |
There was a problem hiding this comment.
we expect to have exactly one RDMA device, i think we should check and fail if more than one.
pkg/devicestate/state.go
Outdated
| } | ||
|
|
||
| // Use RDMA device name in env var key to support multiple RDMA devices | ||
| rdmaDeviceSanitized := strings.ReplaceAll(rdmaDevice, "_", "") |
There was a problem hiding this comment.
we expect to have 1 rdma device per PCI device so IMO we dont need it part of the env var name
pkg/devicestate/state.go
Outdated
| // Include RDMA device name to avoid collisions with multiple RDMA devices | ||
| switch { | ||
| case strings.Contains(charDev, "uverbs"): | ||
| envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_%s_RDMA_UVERBS=%s", devicePrefix, rdmaDeviceSanitized, charDev)) |
c1923aa to
e857b30
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces RDMA device support, including discovery and injection into containers via CDI. The changes are well-structured, with a new handleRDMADevice function to encapsulate the logic and comprehensive unit tests to cover various scenarios. The integration with the host helper interface and the use of a dedicated RDMA provider are good design choices. I have one suggestion to improve the robustness of character device type detection.
| switch { | ||
| case strings.Contains(charDev, "uverbs"): | ||
| envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_RDMA_UVERB=%s", devicePrefix, charDev)) | ||
| case strings.Contains(charDev, "umad"): | ||
| envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_RDMA_UMAD=%s", devicePrefix, charDev)) | ||
| case strings.Contains(charDev, "issm"): | ||
| envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_RDMA_ISSM=%s", devicePrefix, charDev)) | ||
| case strings.Contains(charDev, "rdma_cm"): | ||
| envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_RDMA_CM=%s", devicePrefix, charDev)) | ||
| } |
There was a problem hiding this comment.
Using strings.Contains to identify character device types could be brittle. For example, if a device path were /dev/infiniband/my-uverbs-device, it would incorrectly match as a uverbs device. A more robust approach is to check the prefix of the device file's base name. For rdma_cm, an exact match is even better as it's a singleton device.
Note: This change requires importing the path/filepath package.
| switch { | |
| case strings.Contains(charDev, "uverbs"): | |
| envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_RDMA_UVERB=%s", devicePrefix, charDev)) | |
| case strings.Contains(charDev, "umad"): | |
| envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_RDMA_UMAD=%s", devicePrefix, charDev)) | |
| case strings.Contains(charDev, "issm"): | |
| envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_RDMA_ISSM=%s", devicePrefix, charDev)) | |
| case strings.Contains(charDev, "rdma_cm"): | |
| envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_RDMA_CM=%s", devicePrefix, charDev)) | |
| } | |
| switch { | |
| case strings.HasPrefix(filepath.Base(charDev), "uverbs"): | |
| envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_RDMA_UVERB=%s", devicePrefix, charDev)) | |
| case strings.HasPrefix(filepath.Base(charDev), "umad"): | |
| envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_RDMA_UMAD=%s", devicePrefix, charDev)) | |
| case strings.HasPrefix(filepath.Base(charDev), "issm"): | |
| envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_RDMA_ISSM=%s", devicePrefix, charDev)) | |
| case filepath.Base(charDev) == "rdma_cm": | |
| envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_RDMA_CM=%s", devicePrefix, charDev)) | |
| } |
e857b30 to
bf6ae3e
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces RDMA device support, integrating with CDI for device injection into containers. The implementation is well-designed, with a new handleRDMADevice function in pkg/devicestate/state.go that cleanly encapsulates the logic for handling RDMA-capable devices. The host package is appropriately extended to discover RDMA character devices, maintaining a good separation of concerns. The changes are supported by a comprehensive suite of unit tests that cover various scenarios, including error conditions. Additionally, a minor improvement is suggested for mock accuracy in pkg/devicestate/state_test.go. Overall, this is a solid contribution that adds significant new functionality.
| mockHost.EXPECT().GetRDMADevicesForPCI(pciAddress).Return([]string{rdmaDeviceName}) | ||
|
|
||
| // Mock GetRDMACharDevices to return empty list | ||
| mockHost.EXPECT().GetRDMACharDevices(rdmaDeviceName).Return([]string{}, nil) |
There was a problem hiding this comment.
The mock for GetRDMACharDevices returns an empty slice ([]string{}), but the actual implementation in pkg/host/host.go returns nil when no character devices are found. While the current test passes because len() is 0 for both an empty slice and a nil slice, it's best practice for mocks to accurately reflect the behavior of the functions they are replacing to prevent future breakages if the calling code changes to differentiate between them.
| mockHost.EXPECT().GetRDMACharDevices(rdmaDeviceName).Return([]string{}, nil) | |
| mockHost.EXPECT().GetRDMACharDevices(rdmaDeviceName).Return(nil, nil) |
Implements RDMA capability detection and device exposure using the Mellanox rdmamap library. Adds discovery of RDMA devices (mlx5_*), protocol detection (RoCE/InfiniBand/iWARP), and automatic injection of character devices (/dev/infiniband/*) into containers via CDI. Key changes: - Add RdmaProvider interface with rdmamap integration - Expose RDMA character devices (uverbs, umad, issm, rdma_cm) - Set environment variables for device paths and names - Add comprehensive unit tests for RDMA functionality Matches feature parity with sriov-network-device-plugin for RDMA. Tested on Mellanox ConnectX hardware with RoCE. Signed-off-by: Fred Rolland <frolland@nvidia.com>
bf6ae3e to
988909f
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces RDMA device support, including discovery and injection into containers via CDI. While no specific vulnerabilities or critical issues were found, there is a bug in how environment variables are created for RDMA character devices. This can lead to overwritten environment variables if multiple devices of the same type exist. It is suggested to aggregate these device paths into a comma-separated list, aligning with sriov-network-device-plugin behavior to ensure all device paths are correctly exposed.
| // Add each character device to the CDI spec | ||
| for _, charDev := range charDevices { | ||
| deviceNodes = append(deviceNodes, &cdispec.DeviceNode{ | ||
| Path: charDev, | ||
| HostPath: charDev, | ||
| Type: "c", // character device | ||
| }) | ||
|
|
||
| // Add environment variable for each character device type | ||
| // Include RDMA device name to avoid collisions with multiple RDMA devices | ||
| switch { | ||
| case strings.HasPrefix(filepath.Base(charDev), "uverbs"): | ||
| envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_RDMA_UVERB=%s", devicePrefix, charDev)) | ||
| case strings.HasPrefix(filepath.Base(charDev), "umad"): | ||
| envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_RDMA_UMAD=%s", devicePrefix, charDev)) | ||
| case strings.HasPrefix(filepath.Base(charDev), "issm"): | ||
| envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_RDMA_ISSM=%s", devicePrefix, charDev)) | ||
| case filepath.Base(charDev) == "rdma_cm": | ||
| envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_RDMA_CM=%s", devicePrefix, charDev)) | ||
| } | ||
| } |
There was a problem hiding this comment.
The current implementation for creating environment variables for RDMA character devices has a bug. If there are multiple character devices of the same type (e.g., /dev/infiniband/uverbs0 and /dev/infiniband/uverbs1), this loop will generate multiple environment variables with the same key (e.g., SRIOVNETWORK_<PREFIX>_RDMA_UVERB). This will lead to unpredictable behavior as only one of these values will likely be available in the container's environment.
To ensure correctness and align with the feature parity goal with sriov-network-device-plugin, these paths should be aggregated into a single, comma-separated environment variable. Also, the variable names should be plural (e.g., UVERBS) for consistency.
Here is a suggested implementation:
// Add each character device to the CDI spec
uverbsPaths := []string{}
umadPaths := []string{}
issmPaths := []string{}
rdmaCmPaths := []string{}
for _, charDev := range charDevices {
deviceNodes = append(deviceNodes, &cdispec.DeviceNode{
Path: charDev,
HostPath: charDev,
Type: "c", // character device
})
// Group paths by type to create comma-separated env vars
baseName := filepath.Base(charDev)
switch {
case strings.HasPrefix(baseName, "uverbs"):
uverbsPaths = append(uverbsPaths, charDev)
case strings.HasPrefix(baseName, "umad"):
umadPaths = append(umadPaths, charDev)
case strings.HasPrefix(baseName, "issm"):
issmPaths = append(issmPaths, charDev)
case baseName == "rdma_cm":
rdmaCmPaths = append(rdmaCmPaths, charDev)
}
}
// Add environment variables for each character device type.
// This uses comma-separated lists to support multiple character devices of the same type,
// matching feature parity with sriov-network-device-plugin.
if len(uverbsPaths) > 0 {
envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_RDMA_UVERBS=%s", devicePrefix, strings.Join(uverbsPaths, ",")))
}
if len(umadPaths) > 0 {
envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_RDMA_UMADS=%s", devicePrefix, strings.Join(umadPaths, ",")))
}
if len(issmPaths) > 0 {
envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_RDMA_ISSMS=%s", devicePrefix, strings.Join(issmPaths, ",")))
}
if len(rdmaCmPaths) > 0 {
envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_RDMA_CM=%s", devicePrefix, strings.Join(rdmaCmPaths, ",")))
}
Implements RDMA capability detection and device exposure using the
Mellanox rdmamap library. Adds discovery of RDMA devices (mlx5_),
protocol detection (RoCE/InfiniBand/iWARP), and automatic injection
of character devices (/dev/infiniband/) into containers via CDI.
Key changes:
Matches feature parity with sriov-network-device-plugin for RDMA.
Tested on Mellanox ConnectX hardware with RoCE.
Summary by CodeRabbit
New Features
Tests
Chores
✏️ Tip: You can customize this high-level summary in your review settings.