Skip to content

Comments

feat: Add RDMA device support with CDI integration#38

Open
rollandf wants to merge 1 commit intok8snetworkplumbingwg:mainfrom
rollandf:rdma-attach
Open

feat: Add RDMA device support with CDI integration#38
rollandf wants to merge 1 commit intok8snetworkplumbingwg:mainfrom
rollandf:rdma-attach

Conversation

@rollandf
Copy link
Collaborator

@rollandf rollandf commented Nov 20, 2025

Implements RDMA capability detection and device exposure using the
Mellanox rdmamap library. Adds discovery of RDMA devices (mlx5_),
protocol detection (RoCE/InfiniBand/iWARP), and automatic injection
of character devices (/dev/infiniband/
) into containers via CDI.

Key changes:

  • Add RdmaProvider interface with rdmamap integration
  • Expose RDMA character devices (uverbs, umad, issm, rdma_cm)
  • Set environment variables for device paths and names
  • Add comprehensive unit tests for RDMA functionality

Matches feature parity with sriov-network-device-plugin for RDMA.

Tested on Mellanox ConnectX hardware with RoCE.

Summary by CodeRabbit

  • New Features

    • Per-device RDMA capability detection exposed; RDMA character devices are attached and per-device RDMA environment variables added. NUMA node info is now reported for devices. Host API extended to allow injectable RDMA provider and RDMA lookups.
  • Tests

    • Expanded RDMA-focused tests covering detection, capability verification, multi-device handling, and error scenarios.
  • Chores

    • Updated dependency declarations to include RDMA-related libraries.

✏️ Tip: You can customize this high-level summary in your review settings.

@rollandf
Copy link
Collaborator Author

depends on #32

@coveralls
Copy link
Collaborator

coveralls commented Nov 20, 2025

Pull Request Test Coverage Report for Build 22016933456

Details

  • 81 of 112 (72.32%) changed or added relevant lines in 5 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+1.0%) to 44.897%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/host/rdma_provider.go 0 3 0.0%
pkg/devicestate/state.go 64 71 90.14%
pkg/host/mock/mock_rdma_provider.go 0 10 0.0%
pkg/host/mock/mock_host.go 0 11 0.0%
Totals Coverage Status
Change from base Build 21945906041: 1.0%
Covered Lines: 1443
Relevant Lines: 3214

💛 - Coveralls

@coderabbitai
Copy link

coderabbitai bot commented Dec 21, 2025

Walkthrough

Adds RDMA detection and handling: discovery marks per‑VF RDMA capability and NUMA; host gains an RdmaProvider abstraction and RDMA APIs; state injects RDMA char devices and RDMA-related env vars into CDI specs; tests and mocks updated to cover RDMA flows.

Changes

Cohort / File(s) Summary
Dependency Management
go.mod
Added github.com/Mellanox/rdmamap v1.1.0; bumped indirect network-related deps.
Constants
pkg/consts/consts.go
Added AttributeRDMACapable and SysClassInfiniband.
RDMA Provider Abstraction
pkg/host/rdma_provider.go
New RdmaProvider interface, defaultRdmaProvider delegating to rdmamap, newRdmaProvider() and go:generate for mocks.
Host Interface & Implementation
pkg/host/host.go
Added rdmaProvider field, SetRdmaProvider(), and methods GetRDMADeviceForPCI, VerifyRDMACapability, GetRDMACharDevices delegating to provider.
Host Tests
pkg/host/host_test.go
Added gomock-based tests for the new Host RDMA APIs and injection of MockRdmaProvider.
Host Mocks
pkg/host/mock/mock_host.go, pkg/host/mock/mock_rdma_provider.go
Added mocked Host RDMA methods and generated MockRdmaProvider.
Device Discovery
pkg/devicestate/discovery.go
Per‑VF NUMA parsing (int pointer) and per‑VF RDMACapable checks via VerifyRDMACapability; attributes map updated to include new fields and logging.
Discovery Tests
pkg/devicestate/discovery_test.go
Updated/added tests expecting RDMA capability checks and related attribute behavior (including failure paths).
Device State Configuration
pkg/devicestate/state.go
When RDMA-capable: new helper enumerates RDMA devices and char devices, appends CDI char device nodes, and injects RDMA env vars per RDMA device.
State Tests
pkg/devicestate/state_test.go
Added tests for RDMA preparation (success and error cases), introduced gomock usage and pointer assertions.

Sequence Diagram(s)

sequenceDiagram
    participant Discovery
    participant Host
    participant RdmaProvider
    participant rdmamap
    participant State
    participant CDI

    Discovery->>Host: VerifyRDMACapability(pciAddr)
    Host->>RdmaProvider: GetRdmaDevicesForPcidev(pciAddr)
    RdmaProvider->>rdmamap: GetRdmaDevicesForPcidev(pciAddr)
    rdmamap-->>RdmaProvider: [rdmaDevices]
    RdmaProvider-->>Host: [rdmaDevices]
    Host-->>Discovery: bool (has RDMA devices)
    Discovery->>Discovery: set RDMACapable, NumaNode in VF attributes

    State->>Host: GetRDMADeviceForPCI(pciAddr)
    Host->>RdmaProvider: GetRdmaDevicesForPcidev(pciAddr)
    RdmaProvider->>rdmamap: GetRdmaDevicesForPcidev(pciAddr)
    rdmamap-->>RdmaProvider: [rdmaDevices]
    RdmaProvider-->>Host: [rdmaDevices]
    Host-->>State: [rdmaDevices]

    loop for each rdmaDevice
      State->>Host: GetRDMACharDevices(rdmaDeviceName)
      Host->>RdmaProvider: GetRdmaCharDevices(rdmaDeviceName)
      RdmaProvider->>rdmamap: GetRdmaCharDevices(rdmaDeviceName)
      rdmamap-->>RdmaProvider: [charDevices]
      RdmaProvider-->>Host: [charDevices]
      Host-->>State: [charDevices]
      State->>CDI: Append CDI char device node(s)
      State->>State: Add env vars (uverbs, umad, issm, rdma_cm, device name)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: Add RDMA device support with CDI integration' directly and clearly describes the main objective of the changeset, which is to implement RDMA device support with CDI integration across multiple packages.
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (10)
pkg/devicestate/state_test.go (3)

90-118: Tests only verify mock behavior, not production code.

This test sets up mock expectations and then calls the mock methods directly, verifying that mocks return expected values. It doesn't actually test the production applyConfigOnDevice function that would use these mocks to prepare RDMA devices.

Consider creating an integration test that:

  1. Sets up a Manager instance with the mock host
  2. Calls the actual applyConfigOnDevice or PrepareDevicesForClaim method
  3. Verifies the resulting CDI spec contains the expected device nodes
🔎 Example integration test approach
It("should integrate RDMA device preparation into CDI spec", func() {
    // Setup Manager with mocks
    manager := &Manager{
        allocatable: map[string]resourceapi.Device{
            "0000-08-00-5": *deviceInfo,
        },
        // ... other required fields
    }
    
    // Mock the host.GetHelpers() to return mockHost
    // Then call actual production code and verify CDI spec output
})

120-135: Environment variable format test is documentation-only.

This test documents the expected environment variable naming convention but doesn't verify that the production code actually generates these variables. The assertions only check the test's own expectedEnvVars map.


228-243: Character device type identification test is basic.

This test only verifies that path strings contain expected substrings. The actual device type parsing logic in production code (the switch statement in state.go) is not exercised here.

pkg/devicestate/state.go (3)

217-223: Silent failure on GetRDMADeviceForPCI error may mask issues.

When GetRDMADeviceForPCI fails, the error is logged but the device is still prepared without RDMA support. While this allows graceful degradation, users may not notice that RDMA was silently disabled.

Consider adding a warning-level log or an event to make this more visible, especially since the device was discovered as RDMA-capable during initial discovery.


243-248: Potential duplicate CDI device nodes for shared devices like rdma_cm.

/dev/infiniband/rdma_cm is typically a system-wide device, not per-RDMA-adapter. If GetRDMACharDevices returns it for multiple RDMA devices, it will be added multiple times to deviceNodes.

Consider deduplicating device nodes before adding them to the CDI spec:

🔎 Suggested deduplication approach
// Track added device paths to avoid duplicates
addedDevices := make(map[string]bool)

for _, charDev := range charDevices {
    if addedDevices[charDev] {
        continue // Skip duplicate
    }
    addedDevices[charDev] = true
    
    deviceNodes = append(deviceNodes, &cdispec.DeviceNode{
        Path:     charDev,
        HostPath: charDev,
        Type:     "c",
    })
    // ... rest of the logic
}

252-262: Consider extracting devicePrefix outside the inner loop.

The devicePrefix is recomputed for every character device but only depends on result.Device, which doesn't change within the loops. Moving it outside improves readability and avoids redundant computation.

+		devicePrefix := strings.ReplaceAll(result.Device, "-", "_")
 		for _, rdmaDevice := range rdmaDevices {
 			// ...
 			for _, charDev := range charDevices {
 				// ...
-				devicePrefix := strings.ReplaceAll(result.Device, "-", "_")
 				switch {
pkg/devicestate/discovery.go (1)

145-151: Consider logging at a lower verbosity level for non-critical errors.

The RDMA capability check correctly falls back to false on error. However, using logger.Error for this may be too noisy in environments without RDMA hardware. Consider using logger.V(1).Info or logger.V(2).Info instead, since this is an expected condition for non-RDMA devices.

🔎 Suggested change
 			// Check RDMA capability for this VF
 			rdmaCapable, err := host.GetHelpers().VerifyRDMACapability(vfInfo.PciAddress)
 			if err != nil {
-				logger.Error(err, "Failed to verify RDMA capability, assuming not RDMA capable",
+				logger.V(2).Info("Failed to verify RDMA capability, assuming not RDMA capable",
+					"error", err,
 					"vfAddress", vfInfo.PciAddress)
 				rdmaCapable = false
 			}
pkg/host/host_test.go (1)

582-587: Consider adding type assertion safety check.

The type assertion host.NewHost().(*host.Host) will panic if NewHost() ever returns a different implementation. While this is unlikely in the current codebase, adding a safety check improves test robustness.

🔎 Suggested change
 		BeforeEach(func() {
 			mockCtrl = gomock.NewController(GinkgoT())
 			mockRdmaProvider = mock_host.NewMockRdmaProvider(mockCtrl)
-			hostImpl = host.NewHost().(*host.Host)
+			var ok bool
+			hostImpl, ok = host.NewHost().(*host.Host)
+			Expect(ok).To(BeTrue(), "NewHost() should return *host.Host")
 			hostImpl.SetRdmaProvider(mockRdmaProvider)
 		})
pkg/host/host.go (2)

153-157: Consider adding nil guard for defensive coding.

While this method is primarily for testing, a nil check would prevent potential nil pointer dereferences if called incorrectly.

🔎 Proposed defensive nil check
 func (h *Host) SetRdmaProvider(provider RdmaProvider) {
+	if provider == nil {
+		return
+	}
 	h.rdmaProvider = provider
 }

786-800: Error return value is never non-nil.

The method signature includes an error return, but the implementation never returns an error since rdmaProvider.GetRdmaDevicesForPcidev only returns []string. This may be intentional for future extensibility, but consider documenting this in the method's godoc or updating the interface if errors should be propagated.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 27679f3 and 8444628.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (11)
  • go.mod (2 hunks)
  • pkg/consts/consts.go (2 hunks)
  • pkg/devicestate/discovery.go (1 hunks)
  • pkg/devicestate/discovery_test.go (7 hunks)
  • pkg/devicestate/state.go (1 hunks)
  • pkg/devicestate/state_test.go (2 hunks)
  • pkg/host/host.go (3 hunks)
  • pkg/host/host_test.go (2 hunks)
  • pkg/host/mock/mock_host.go (2 hunks)
  • pkg/host/mock/mock_rdma_provider.go (1 hunks)
  • pkg/host/rdma_provider.go (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-12-10T11:30:04.648Z
Learnt from: rollandf
Repo: k8snetworkplumbingwg/dra-driver-sriov PR: 32
File: pkg/host/host.go:818-879
Timestamp: 2025-12-10T11:30:04.648Z
Learning: In Linux sysfs, /sys/class/infiniband/<device>/node_type exposes values like '1: CA' or '4: RNIC'. Implement parsing in Go by checking the numeric prefix (e.g., '1:' or '4:') rather than the text after. Use strings.HasPrefix(value, "1:") or strings.HasPrefix(value, "4:") to dispatch, and then map to a stable enum or constant (e.g., ChannelAdapter, RNIC). Avoid brittle text comparisons; validate that the prefix exists and handle unexpected values gracefully.

Applied to files:

  • pkg/host/rdma_provider.go
  • pkg/host/mock/mock_rdma_provider.go
  • pkg/host/host_test.go
  • pkg/host/mock/mock_host.go
  • pkg/host/host.go
📚 Learning: 2025-12-10T11:30:12.071Z
Learnt from: rollandf
Repo: k8snetworkplumbingwg/dra-driver-sriov PR: 32
File: pkg/host/host.go:818-879
Timestamp: 2025-12-10T11:30:12.071Z
Learning: In Linux sysfs, /sys/class/infiniband/<device>/node_type returns numeric format like "1: CA" not plain text "CA". The numeric prefixes are: 1 = CA (Channel Adapter - InfiniBand or RoCE), 4 = RNIC (iWARP). Code should use strings.HasPrefix() to parse these values.

Applied to files:

  • pkg/consts/consts.go
🧬 Code graph analysis (5)
pkg/devicestate/discovery.go (2)
pkg/host/host.go (1)
  • GetHelpers (148-151)
pkg/consts/consts.go (12)
  • AttributeVendorID (35-35)
  • AttributeDeviceID (36-36)
  • AttributePFDeviceID (37-37)
  • AttributePciAddress (32-32)
  • AttributePFName (33-33)
  • AttributeEswitchMode (34-34)
  • AttributeVFID (38-38)
  • AttributeNumaNode (42-42)
  • AttributePCIeRoot (60-60)
  • AttributeParentPciAddress (47-47)
  • AttributeStandardPciAddress (44-44)
  • AttributeRDMACapable (40-40)
pkg/host/host_test.go (2)
pkg/host/mock/mock_rdma_provider.go (2)
  • MockRdmaProvider (19-23)
  • NewMockRdmaProvider (31-35)
pkg/host/host.go (2)
  • Host (121-124)
  • NewHost (127-132)
pkg/host/mock/mock_host.go (1)
pkg/cni/mock/mock_cni.go (2)
  • MockInterface (23-27)
  • MockInterfaceMockRecorder (30-32)
pkg/devicestate/state.go (2)
pkg/consts/consts.go (1)
  • AttributeRDMACapable (40-40)
pkg/host/host.go (1)
  • GetHelpers (148-151)
pkg/host/host.go (1)
pkg/host/rdma_provider.go (1)
  • RdmaProvider (27-30)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Lint
  • GitHub Check: Build Container
  • GitHub Check: Unit Tests
🔇 Additional comments (23)
go.mod (2)

83-84: Transitive dependencies look appropriate.

The vishvananda/netlink and vishvananda/netns are standard Go libraries for Linux network namespace and netlink operations, commonly used alongside RDMA libraries.


8-8: LGTM - rdmamap v1.1.0 for RDMA device discovery and mapping.

The github.com/Mellanox/rdmamap library provides a map of RDMA devices with their character and network devices, making it suitable for RDMA device support in this DRA driver context. rdmamap v1.1.0 is stable and actively used in production K8s plugins.

pkg/consts/consts.go (2)

40-40: LGTM - RDMA capability attribute follows existing conventions.

The AttributeRDMACapable constant follows the established naming pattern and will be used to mark RDMA-capable VFs during discovery.


52-54: LGTM - Correct sysfs path for InfiniBand devices.

/sys/class/infiniband is the standard Linux sysfs path for discovering RDMA/InfiniBand devices and their properties.

pkg/devicestate/state_test.go (2)

51-88: Test setup looks correct, but consider testing actual integration.

The mock setup and lifecycle management are properly implemented. However, the deviceInfo variable is created but not used to invoke any production code that would exercise the RDMA preparation logic.


170-193: Good coverage for non-RDMA device skip behavior.

This test correctly verifies that GetRDMACharDevices is never called when a device is not RDMA-capable, which validates the guard condition in production code.

pkg/devicestate/discovery_test.go (3)

81-82: LGTM - Existing tests updated for RDMA capability verification.

The existing test cases are properly updated to include VerifyRDMACapability mock expectations, ensuring the discovery flow remains testable after the RDMA changes.


255-321: Comprehensive test for RDMA-capable VF discovery.

This test covers:

  • Mixed RDMA capability (VF1 capable, VF2 not capable)
  • Correct attribute propagation for RDMA-capable devices
  • Verification of all standard and RDMA-specific attributes

The assertions at lines 314 and 320 properly verify the AttributeRDMACapable boolean attribute for both cases.


323-357: Good error handling test for RDMA capability check failures.

This test verifies that when VerifyRDMACapability returns an error, the discovery process:

  1. Does not fail entirely
  2. Defaults the device to non-RDMA capable (false)

This graceful degradation ensures device discovery continues even if RDMA capability detection fails for individual devices.

pkg/devicestate/state.go (1)

217-273: Overall RDMA integration logic is sound.

The implementation correctly:

  • Guards RDMA processing with capability check
  • Handles errors gracefully without failing device preparation
  • Adds appropriate CDI device nodes for RDMA character devices
  • Sets environment variables following sriov-network-device-plugin conventions

The structure integrates well with the existing VFIO and vhost device handling patterns.

pkg/devicestate/discovery.go (2)

135-139: LGTM! Safe NUMA node parsing with reasonable default.

The parsing logic correctly defaults to 0 when parsing fails, and the use of ptr.To is consistent with the rest of the codebase.


162-208: Well-structured attribute map with proper typing.

The refactored attributes map is clean and readable. All attribute constants are correctly referenced from consts package, and the pointer types are consistently used via ptr.To. The new RDMACapable boolean attribute is properly integrated.

pkg/host/host_test.go (3)

593-627: Good test coverage for GetRDMADeviceForPCI.

The tests properly cover:

  • Multiple RDMA devices returned
  • Empty device list (returns nil)
  • Non-RDMA-capable devices

Mock expectations are correctly configured with exact call counts.


629-662: Good test coverage for VerifyRDMACapability.

Tests appropriately verify that capability returns true when devices exist and false when the device list is empty.


664-744: Comprehensive test coverage for GetRDMACharDevices.

The tests cover multiple scenarios including:

  • Full set of character devices
  • Subset of devices
  • Empty device list
  • Multiple RDMA devices with different character devices

This provides good confidence in the RDMA character device discovery functionality.

pkg/host/mock/mock_host.go (1)

1-10: Auto-generated mock file - no manual changes needed.

This file is generated by mockgen and follows standard gomock patterns. Ensure the go:generate directive in the source file is run when interface methods are added or changed.

pkg/host/rdma_provider.go (2)

23-30: Clean abstraction design for testability.

The RdmaProvider interface provides a thin wrapper over the rdmamap library, enabling easy mocking in unit tests. The go:generate directive is correctly placed to auto-generate mocks.


32-47: Well-implemented provider with appropriate receiver type.

The implementation correctly delegates to rdmamap's GetRdmaDevicesForPcidev and GetRdmaCharDevices functions, which both return []string. Using a value receiver on the empty defaultRdmaProvider struct is appropriate since there's no internal state.

pkg/host/mock/mock_rdma_provider.go (1)

1-10: Auto-generated mock file - no manual changes needed.

This file is correctly generated from rdma_provider.go and follows standard gomock conventions. The mock is properly used in host_test.go for testing RDMA functionality.

pkg/host/host.go (4)

113-118: LGTM!

The RDMA interface methods are well-designed with appropriate return types and consistent naming following the existing interface patterns.


121-124: LGTM!

Good use of dependency injection pattern by adding rdmaProvider as a field, enabling testability through mock injection.


127-132: LGTM!

Constructor properly initializes the rdmaProvider with the default implementation.


802-835: LGTM!

VerifyRDMACapability and GetRDMACharDevices implementations are clean and consistent with the file's patterns. Error handling in VerifyRDMACapability is properly structured to wrap any potential future errors from GetRDMADeviceForPCI.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
pkg/devicestate/state_test.go (3)

90-118: Tests verify mock behavior rather than production code.

This test sets up mock expectations and then directly calls the mocked methods (mockHost.GetRDMADeviceForPCI, mockHost.GetRDMACharDevices). It doesn't actually test any production code in state.go that would use these mocks.

A proper integration test should:

  1. Create a Manager instance with the mocked host
  2. Call a real Manager method that internally invokes the RDMA host functions
  3. Verify the end result (e.g., CDI spec contains expected device nodes)

Currently this only confirms that gomock returns what you told it to return.


120-136: Test verifies hardcoded expectations without exercising production code.

This test constructs an expectedEnvVars map and asserts that it has 5 entries and contains specific keys. No production code that generates environment variables is invoked. The test is essentially validating a map literal against itself.

Consider invoking the actual environment variable generation logic from state.go and verifying the output matches expectations.


229-244: Test verifies ContainSubstring rather than production device-type logic.

This test only checks that paths contain expected substrings using Gomega matchers. If there's production code that parses device paths to identify types (e.g., extracting "uverbs" from "/dev/infiniband/uverbs0"), that function should be tested directly. Otherwise, this test adds no value.

pkg/host/host.go (2)

786-800: Consider using V(2) for the success log to reduce noise.

Line 797 logs at Info level for every successful RDMA device lookup. On systems with many RDMA-capable devices, this could be verbose. The "no devices found" case uses V(2) which is appropriate; the success case should likely match.

🔎 Suggested fix
-	h.log.Info("GetRDMADeviceForPCI(): found RDMA devices",
+	h.log.V(2).Info("GetRDMADeviceForPCI(): found RDMA devices",
 		"pciAddress", pciAddr, "rdmaDevices", rdmaDevices)

818-835: Same logging level consideration as GetRDMADeviceForPCI.

Line 832 also logs at Info level for successful lookups. For consistency and to reduce log noise, consider using V(2) here as well.

🔎 Suggested fix
-	h.log.Info("GetRDMACharDevices(): found character devices",
+	h.log.V(2).Info("GetRDMACharDevices(): found character devices",
 		"rdmaDevice", rdmaDeviceName, "charDevices", charDevices)
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8444628 and a007b52.

📒 Files selected for processing (7)
  • pkg/devicestate/state.go (1 hunks)
  • pkg/devicestate/state_test.go (2 hunks)
  • pkg/host/host.go (3 hunks)
  • pkg/host/host_test.go (2 hunks)
  • pkg/host/mock/mock_host.go (2 hunks)
  • pkg/host/mock/mock_rdma_provider.go (1 hunks)
  • pkg/host/rdma_provider.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/host/mock/mock_host.go
  • pkg/devicestate/state.go
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-10T11:30:04.648Z
Learnt from: rollandf
Repo: k8snetworkplumbingwg/dra-driver-sriov PR: 32
File: pkg/host/host.go:818-879
Timestamp: 2025-12-10T11:30:04.648Z
Learning: In Linux sysfs, /sys/class/infiniband/<device>/node_type exposes values like '1: CA' or '4: RNIC'. Implement parsing in Go by checking the numeric prefix (e.g., '1:' or '4:') rather than the text after. Use strings.HasPrefix(value, "1:") or strings.HasPrefix(value, "4:") to dispatch, and then map to a stable enum or constant (e.g., ChannelAdapter, RNIC). Avoid brittle text comparisons; validate that the prefix exists and handle unexpected values gracefully.

Applied to files:

  • pkg/host/host_test.go
  • pkg/host/host.go
  • pkg/host/mock/mock_rdma_provider.go
  • pkg/host/rdma_provider.go
🧬 Code graph analysis (1)
pkg/host/host_test.go (2)
pkg/host/mock/mock_rdma_provider.go (2)
  • MockRdmaProvider (19-23)
  • NewMockRdmaProvider (31-35)
pkg/host/host.go (2)
  • Host (121-124)
  • NewHost (127-132)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Lint
  • GitHub Check: Build Container
🔇 Additional comments (11)
pkg/devicestate/state_test.go (1)

138-169: Same pattern: tests mock returns rather than production integration.

Similar to the previous tests, this sets up expectations and directly calls the mock. Consider consolidating these into integration tests that exercise Manager methods.

pkg/host/host_test.go (4)

582-591: LGTM! Proper test setup with mock injection.

The test correctly initializes the mock controller, creates a mock RDMA provider, and injects it into the Host instance via SetRdmaProvider. The AfterEach properly calls Finish() to verify expectations.


593-627: LGTM! Good coverage of GetRDMADeviceForPCI scenarios.

Tests cover:

  • Multiple RDMA devices returned
  • Empty list converted to nil
  • Non-RDMA-capable device returning nil

These align with the production code logic in host.go.


629-662: LGTM! VerifyRDMACapability tests appropriately verify boolean logic.


664-744: LGTM! Comprehensive GetRDMACharDevices test coverage.

Tests cover full device lists, partial lists, empty returns (nil conversion), and differentiation across multiple RDMA devices. Well-structured test cases.

pkg/host/mock/mock_rdma_provider.go (1)

1-68: LGTM! Auto-generated mock implementation.

This is standard mockgen-generated code matching the RdmaProvider interface. No manual review needed for generated code.

pkg/host/host.go (4)

113-118: LGTM! Clean interface extension for RDMA operations.

The new RDMA methods follow the existing interface patterns and provide a consistent API for RDMA device discovery and capability verification.


120-132: LGTM! Host struct extended with RDMA provider dependency.

The rdmaProvider field and its initialization in NewHost() follow dependency injection patterns, enabling testability.


153-157: LGTM! Test injection method for RDMA provider.

Clear documentation that this is primarily for unit testing. An alternative would be functional options in NewHost(), but the current approach is pragmatic and sufficient.


802-816: LGTM! VerifyRDMACapability correctly wraps device lookup.

Simple boolean wrapper that properly propagates errors and checks device presence.

pkg/host/rdma_provider.go (1)

27-47: LGTM! Clean abstraction layer for RDMA device discovery.

The RdmaProvider interface provides a thin wrapper around the rdmamap library, enabling testability through mock injection. The interface methods correctly mirror the underlying library's API, which returns empty slices on failure rather than error values.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (10)
pkg/devicestate/state_test.go (3)

90-118: Tests exercise mocks directly rather than production code.

The test sets up mock expectations and then calls mockHost.GetRDMADeviceForPCI() and mockHost.GetRDMACharDevices() directly. This verifies the mock works but doesn't test any production logic. Consider testing the actual applyConfigOnDevice function or similar production code path that uses these host methods.


120-136: Test validates manually constructed expectations without exercising production code.

This test constructs the expected environment variables and then asserts they exist in the map that was just created. To provide value, consider calling the production code that generates these environment variables and validating the output matches expectations.


236-251: Test validates self-fulfilling assertions.

The test verifies that /dev/infiniband/uverbs0 contains "uverbs", which is trivially true. To add value, consider testing the production code's switch statement logic from state.go lines 256-265 that identifies device types from paths.

pkg/devicestate/state.go (3)

256-265: Consider more precise device type matching.

Using strings.Contains could match incorrectly in edge cases. For example, a hypothetical path containing both substrings. Consider using filepath.Base and prefix matching for more robust identification:

🔎 Suggested improvement
+				baseName := filepath.Base(charDev)
 				switch {
-				case strings.Contains(charDev, "uverbs"):
+				case strings.HasPrefix(baseName, "uverbs"):
 					envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_%s_RDMA_UVERBS=%s", devicePrefix, rdmaDeviceSanitized, charDev))
-				case strings.Contains(charDev, "umad"):
+				case strings.HasPrefix(baseName, "umad"):
 					envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_%s_RDMA_UMAD=%s", devicePrefix, rdmaDeviceSanitized, charDev))
-				case strings.Contains(charDev, "issm"):
+				case strings.HasPrefix(baseName, "issm"):
 					envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_%s_RDMA_ISSM=%s", devicePrefix, rdmaDeviceSanitized, charDev))
-				case strings.Contains(charDev, "rdma_cm"):
+				case baseName == "rdma_cm":
 					envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_%s_RDMA_CM=%s", devicePrefix, rdmaDeviceSanitized, charDev))
 				}

268-269: Consider consistent log levels.

Line 268 uses logger.Info while similar messages (lines 224-225, 237-239) use logger.V(2).Info. For consistency and to avoid verbose production logs, consider using V(2) here as well.

🔎 Suggested change
-			logger.Info("Added RDMA character devices for device",
+			logger.V(2).Info("Added RDMA character devices for device",
 				"device", pciAddress, "rdmaDevice", rdmaDevice, "charDevices", charDevices)

246-266: Deduplicate shared RDMA character devices in CDI spec.

The code appends character devices without deduplication. When multiple RDMA devices (e.g., mlx5_5 and mlx5_6) are present, shared devices like /dev/infiniband/rdma_cm will be added multiple times. Consider deduplicating deviceNodes by path before appending, using a set or checking for existing entries to prevent redundant device node entries.

pkg/host/host_test.go (2)

582-587: Type assertion without safety check.

The type assertion at line 585 will panic if NewHost() returns a type other than *host.Host. While this is unlikely in practice, consider using a safe assertion or verifying the test setup:

🔎 Suggested improvement
 		BeforeEach(func() {
 			mockCtrl = gomock.NewController(GinkgoT())
 			mockRdmaProvider = mock_host.NewMockRdmaProvider(mockCtrl)
-			hostImpl = host.NewHost().(*host.Host)
+			var ok bool
+			hostImpl, ok = host.NewHost().(*host.Host)
+			Expect(ok).To(BeTrue(), "NewHost() should return *Host")
 			hostImpl.SetRdmaProvider(mockRdmaProvider)
 		})

606-626: Duplicate test cases.

The tests "should return nil when no RDMA devices exist" and "should return nil when device is not RDMA capable" have identical mock setups and assertions. From the production code's perspective, both scenarios result in an empty list from the provider. Consider consolidating into a single test or differentiating the scenarios if there's intended behavioral difference.

pkg/host/host.go (2)

797-799: Consider V(2) log level for routine RDMA device discovery.

The Info level log at line 797 will output for every RDMA-capable device during normal operation. Consider using V(2).Info for consistency with the "not found" case at line 793.

🔎 Suggested change
-	h.log.Info("GetRDMADeviceForPCI(): found RDMA devices",
+	h.log.V(2).Info("GetRDMADeviceForPCI(): found RDMA devices",
 		"pciAddress", pciAddr, "rdmaDevices", rdmaDevices)

832-834: Same log level consideration as GetRDMADeviceForPCI.

For consistency, consider using V(2).Info here as well.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a007b52 and 354658e.

📒 Files selected for processing (7)
  • pkg/devicestate/state.go (1 hunks)
  • pkg/devicestate/state_test.go (2 hunks)
  • pkg/host/host.go (3 hunks)
  • pkg/host/host_test.go (2 hunks)
  • pkg/host/mock/mock_host.go (2 hunks)
  • pkg/host/mock/mock_rdma_provider.go (1 hunks)
  • pkg/host/rdma_provider.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/host/mock/mock_host.go
  • pkg/host/mock/mock_rdma_provider.go
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-10T11:30:04.648Z
Learnt from: rollandf
Repo: k8snetworkplumbingwg/dra-driver-sriov PR: 32
File: pkg/host/host.go:818-879
Timestamp: 2025-12-10T11:30:04.648Z
Learning: In Linux sysfs, /sys/class/infiniband/<device>/node_type exposes values like '1: CA' or '4: RNIC'. Implement parsing in Go by checking the numeric prefix (e.g., '1:' or '4:') rather than the text after. Use strings.HasPrefix(value, "1:") or strings.HasPrefix(value, "4:") to dispatch, and then map to a stable enum or constant (e.g., ChannelAdapter, RNIC). Avoid brittle text comparisons; validate that the prefix exists and handle unexpected values gracefully.

Applied to files:

  • pkg/host/host_test.go
  • pkg/host/host.go
  • pkg/host/rdma_provider.go
🧬 Code graph analysis (3)
pkg/host/host_test.go (2)
pkg/host/mock/mock_rdma_provider.go (2)
  • MockRdmaProvider (19-23)
  • NewMockRdmaProvider (31-35)
pkg/host/host.go (2)
  • Host (121-124)
  • NewHost (127-132)
pkg/devicestate/state.go (2)
pkg/consts/consts.go (1)
  • AttributeRDMACapable (40-40)
pkg/host/host.go (1)
  • GetHelpers (148-151)
pkg/host/host.go (1)
pkg/host/rdma_provider.go (1)
  • RdmaProvider (27-30)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Build Container
  • GitHub Check: Lint
🔇 Additional comments (8)
pkg/devicestate/state_test.go (1)

171-201: Improved test for non-RDMA device handling.

The test now directly validates the conditional logic that determines whether RDMA preparation should occur, which is an improvement over the previous version. The condition at line 194 correctly mirrors the production code condition.

pkg/devicestate/state.go (1)

217-276: RDMA integration looks well-structured.

The RDMA handling block correctly:

  • Checks for RDMA capability before proceeding
  • Handles errors gracefully without failing device preparation
  • Includes RDMA device name in env var keys to prevent collisions with multiple RDMA devices
  • Adds character devices to CDI spec for container access
pkg/host/host_test.go (1)

664-744: Comprehensive test coverage for GetRDMACharDevices.

Good coverage of edge cases including:

  • Full device list return
  • Partial device list return
  • Empty/nil returns
  • Multiple RDMA devices with different character devices
pkg/host/rdma_provider.go (2)

27-42: Clean abstraction enabling testability.

The RdmaProvider interface properly wraps the external rdmamap library, enabling mock injection for unit tests. The stateless defaultRdmaProvider struct appropriately uses value receivers.


19-21: Consider light maintenance status of github.com/Mellanox/rdmamap dependency.

The implementation depends on github.com/Mellanox/rdmamap (v1.1.0), which is stable but has minimal ongoing maintenance (last release November 23, 2022). While the library is production-suitable, be aware that this lightly-maintained dependency may have limited community support for future issues or compatibility updates.

pkg/host/host.go (3)

113-118: Clean interface extension for RDMA functionality.

The new RDMA methods follow the existing interface patterns and provide a clear API for RDMA device discovery and capability verification.


153-157: Good design for test injection.

Exposing SetRdmaProvider on the concrete *Host type rather than the Interface keeps the main API clean while enabling mock injection in tests.


802-816: LGTM.

VerifyRDMACapability correctly delegates to GetRDMADeviceForPCI and properly handles potential future error cases.

@rollandf
Copy link
Collaborator Author

rollandf commented Dec 21, 2025

Get pod names

POD1=$(kubectl get pod -l app=sriov-rdma-test -o jsonpath='{.items[0].metadata.name}')
POD2=$(kubectl get pod -l app=sriov-rdma-test -o jsonpath='{.items[1].metadata.name}')

echo "Pod 1: $POD1"
echo "Pod 2: $POD2"

Pod 1: sriov-rdma-test-4hdhd
Pod 2: sriov-rdma-test-gnbsr

Check RDMA environment variables

kubectl exec $POD1 -- env | grep RDMA

SRIOVNETWORK_0000_08_00_3_mlx53_RDMA_UMAD=/dev/infiniband/umad3
SRIOVNETWORK_0000_08_00_3_mlx53_RDMA_UVERBS=/dev/infiniband/uverbs3
SRIOVNETWORK_0000_08_00_3_mlx53_RDMA_CM=/dev/infiniband/rdma_cm
SRIOVNETWORK_0000_08_00_3_mlx53_RDMA_DEVICE=mlx5_3

Check RDMA devices

kubectl exec $POD1 -- ls -la /dev/infiniband/
kubectl exec $POD2 -- ls -la /dev/infiniband/
total 0
drwxr-xr-x 2 root root      100 Dec 21 13:42 .
drwxr-xr-x 6 root root      380 Dec 21 13:42 ..
crw-rw-rw- 1 root root  10, 122 Dec 21 13:42 rdma_cm
crw-rw-rw- 1 root root 231,   3 Dec 21 13:42 umad3
crw-rw-rw- 1 root root 231, 195 Dec 21 13:42 uverbs3
total 0
drwxr-xr-x 2 root root      100 Dec 21 13:42 .
drwxr-xr-x 6 root root      380 Dec 21 13:42 ..
crw-rw-rw- 1 root root  10, 122 Dec 21 13:42 rdma_cm
crw-rw-rw- 1 root root 231,   3 Dec 21 13:42 umad3
crw-rw-rw- 1 root root 231, 195 Dec 21 13:42 uverbs3

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
pkg/host/host.go (1)

784-835: Consider adding input validation for RDMA methods.

The RDMA device methods work correctly but could benefit from basic input validation:

  • GetRDMADeviceForPCI and VerifyRDMACapability: validate PCI address format (e.g., check for domain:bus:device.function pattern)
  • GetRDMACharDevices: validate rdmaDeviceName is non-empty

This would provide earlier error detection and clearer failure messages for malformed inputs, though the current implementation will handle invalid inputs gracefully by returning empty results.

Example validation approach
 func (h *Host) GetRDMADeviceForPCI(pciAddr string) ([]string, error) {
+	if pciAddr == "" {
+		return nil, fmt.Errorf("pciAddr cannot be empty")
+	}
+	// Optional: validate PCI address format (DDDD:BB:DD.F)
+	if len(strings.Split(pciAddr, ":")) != 3 {
+		return nil, fmt.Errorf("invalid PCI address format: %s", pciAddr)
+	}
 	h.log.V(2).Info("GetRDMADeviceForPCI(): getting RDMA devices for PCI address", "device", pciAddr)
 func (h *Host) GetRDMACharDevices(rdmaDeviceName string) ([]string, error) {
+	if rdmaDeviceName == "" {
+		return nil, fmt.Errorf("rdmaDeviceName cannot be empty")
+	}
 	h.log.V(2).Info("GetRDMACharDevices(): getting character devices for RDMA device", "rdmaDevice", rdmaDeviceName)
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 354658e and cb91e4d.

📒 Files selected for processing (7)
  • pkg/devicestate/state.go (1 hunks)
  • pkg/devicestate/state_test.go (2 hunks)
  • pkg/host/host.go (3 hunks)
  • pkg/host/host_test.go (2 hunks)
  • pkg/host/mock/mock_host.go (2 hunks)
  • pkg/host/mock/mock_rdma_provider.go (1 hunks)
  • pkg/host/rdma_provider.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • pkg/devicestate/state_test.go
  • pkg/host/mock/mock_host.go
  • pkg/devicestate/state.go
  • pkg/host/host_test.go
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-10T11:30:04.648Z
Learnt from: rollandf
Repo: k8snetworkplumbingwg/dra-driver-sriov PR: 32
File: pkg/host/host.go:818-879
Timestamp: 2025-12-10T11:30:04.648Z
Learning: In Linux sysfs, /sys/class/infiniband/<device>/node_type exposes values like '1: CA' or '4: RNIC'. Implement parsing in Go by checking the numeric prefix (e.g., '1:' or '4:') rather than the text after. Use strings.HasPrefix(value, "1:") or strings.HasPrefix(value, "4:") to dispatch, and then map to a stable enum or constant (e.g., ChannelAdapter, RNIC). Avoid brittle text comparisons; validate that the prefix exists and handle unexpected values gracefully.

Applied to files:

  • pkg/host/host.go
  • pkg/host/mock/mock_rdma_provider.go
  • pkg/host/rdma_provider.go
🧬 Code graph analysis (1)
pkg/host/host.go (1)
pkg/host/rdma_provider.go (1)
  • RdmaProvider (27-30)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Build Container
  • GitHub Check: Lint
🔇 Additional comments (5)
pkg/host/rdma_provider.go (1)

23-47: LGTM! Clean abstraction for RDMA provider.

The RdmaProvider interface and default implementation follow best practices for dependency injection and testability. The delegation to the rdmamap library is straightforward, and the factory function appropriately returns the interface type.

pkg/host/mock/mock_rdma_provider.go (1)

1-68: Generated mock implementation is correct.

This gomock-generated mock correctly implements the RdmaProvider interface with the expected methods and recorder pattern.

pkg/host/host.go (3)

114-117: Well-designed RDMA interface additions.

The three new methods provide a clear and consistent API for RDMA device discovery and verification. The naming follows the existing convention and the separation of concerns (device lookup vs. capability verification) is appropriate.


122-132: Proper initialization of RDMA provider.

The rdmaProvider field is correctly initialized in NewHost() and uses the interface type for testability. The initialization pattern is consistent with the rest of the codebase.


153-157: Good test injection mechanism.

SetRdmaProvider enables clean dependency injection for unit tests. The method is well-documented and follows a common pattern for test seams.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
pkg/devicestate/state_test.go (1)

49-81: Test validates boolean logic but doesn't exercise production code.

The test replicates the conditional expression (line 73) but never invokes the actual production function (e.g., applyConfigOnDevice). This only verifies that Go evaluates boolean expressions correctly, not that your RDMA preparation logic behaves as expected.

Consider updating the test to actually call the production code path with a non-RDMA device and verify that RDMA-related methods (e.g., GetRDMADeviceForPCI, GetRDMACharDevices) are not invoked. This would provide meaningful coverage of the skip logic.

🔎 Suggested approach

Instead of duplicating the conditional logic:

  1. Set up a mock host with expectations that RDMA methods should not be called
  2. Call applyConfigOnDevice or the relevant production function with a non-RDMA device
  3. Verify the mock expectations (that RDMA methods were not called)
  4. Assert the returned PreparedDevice does not contain RDMA device nodes or environment variables

This would test the actual behavior rather than the boolean expression.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cb91e4d and d3c8cdb.

📒 Files selected for processing (7)
  • pkg/devicestate/state.go (1 hunks)
  • pkg/devicestate/state_test.go (2 hunks)
  • pkg/host/host.go (3 hunks)
  • pkg/host/host_test.go (2 hunks)
  • pkg/host/mock/mock_host.go (2 hunks)
  • pkg/host/mock/mock_rdma_provider.go (1 hunks)
  • pkg/host/rdma_provider.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • pkg/host/mock/mock_host.go
  • pkg/host/mock/mock_rdma_provider.go
  • pkg/host/rdma_provider.go
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-10T11:30:04.648Z
Learnt from: rollandf
Repo: k8snetworkplumbingwg/dra-driver-sriov PR: 32
File: pkg/host/host.go:818-879
Timestamp: 2025-12-10T11:30:04.648Z
Learning: In Linux sysfs, /sys/class/infiniband/<device>/node_type exposes values like '1: CA' or '4: RNIC'. Implement parsing in Go by checking the numeric prefix (e.g., '1:' or '4:') rather than the text after. Use strings.HasPrefix(value, "1:") or strings.HasPrefix(value, "4:") to dispatch, and then map to a stable enum or constant (e.g., ChannelAdapter, RNIC). Avoid brittle text comparisons; validate that the prefix exists and handle unexpected values gracefully.

Applied to files:

  • pkg/host/host_test.go
  • pkg/host/host.go
🧬 Code graph analysis (3)
pkg/devicestate/state_test.go (1)
pkg/consts/consts.go (2)
  • AttributePciAddress (32-32)
  • AttributeRDMACapable (40-40)
pkg/host/host_test.go (2)
pkg/host/mock/mock_rdma_provider.go (2)
  • MockRdmaProvider (19-23)
  • NewMockRdmaProvider (31-35)
pkg/host/host.go (2)
  • Host (121-124)
  • NewHost (127-132)
pkg/host/host.go (2)
pkg/host/rdma_provider.go (1)
  • RdmaProvider (27-30)
pkg/cni/interface.go (1)
  • Interface (30-33)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Build Container
  • GitHub Check: Lint
🔇 Additional comments (4)
pkg/devicestate/state.go (1)

217-276: RDMA integration looks solid; collision issue resolved.

The RDMA handling correctly addresses the previous environment variable collision concern by including both devicePrefix and rdmaDeviceSanitized in environment variable keys (lines 258-264, 272-273), ensuring multiple RDMA devices per PCI address won't overwrite each other's variables.

Error handling appropriately continues on failure (lines 231-234, 236-240), and the integration follows the established pattern for VFIO and vhost device handling.

pkg/host/host_test.go (1)

593-766: Comprehensive RDMA test coverage.

The test suite thoroughly covers:

  • Happy paths for RDMA device discovery and character device retrieval
  • Edge cases (empty lists, non-RDMA devices)
  • Input validation (empty/invalid PCI addresses, empty device names)
  • Multi-device scenarios to ensure different devices return different character devices

The gomock-based approach properly isolates the RDMA provider for focused unit testing.

pkg/host/host.go (2)

153-157: LGTM - Test injection pattern.

The SetRdmaProvider method appropriately enables mock injection for unit testing while documenting its primary purpose. This is a clean approach to dependency injection.


784-849: RDMA functions are well-structured with appropriate logging and error handling.

The three RDMA methods follow consistent patterns:

  • Input validation for empty/malformed parameters
  • Delegation to the rdmaProvider abstraction
  • Appropriate return of nil for "not found" vs. errors for invalid input
  • Logging at suitable verbosity levels (V(2) for debug, Info for discoveries)

The implementations correctly handle the case where RDMA devices don't exist by returning nil rather than errors, which aligns with the optional nature of RDMA capability.

Comment on lines 582 to 587
BeforeEach(func() {
mockCtrl = gomock.NewController(GinkgoT())
mockRdmaProvider = mock_host.NewMockRdmaProvider(mockCtrl)
hostImpl = host.NewHost().(*host.Host)
hostImpl.SetRdmaProvider(mockRdmaProvider)
})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Type assertion could panic without check.

Line 585 performs an unchecked type assertion that would panic if NewHost() returns an unexpected implementation. While this works in practice since NewHost() returns *Host, a safer approach would validate the assertion.

🔎 Safer type assertion
-			hostImpl = host.NewHost().(*host.Host)
+			var ok bool
+			hostImpl, ok = host.NewHost().(*host.Host)
+			Expect(ok).To(BeTrue(), "NewHost() should return *host.Host")
🤖 Prompt for AI Agents
In pkg/host/host_test.go around lines 582 to 587, the unchecked type assertion
host.NewHost().(*host.Host) can panic if NewHost() returns a different
implementation; replace it with a safe comma-ok assertion (or type switch) and
fail the test if the assertion is false. Specifically, assign the result to an
interface variable, do v, ok := result.(*host.Host), use Ginkgo/Gomega (e.g.,
Expect(ok).To(BeTrue()) or Failf) to stop the test on mismatch, and only call
v.SetRdmaProvider(...) when ok is true.

@rollandf rollandf force-pushed the rdma-attach branch 2 times, most recently from 54c02d4 to 0958b26 Compare December 22, 2025 08:46
@SchSeba SchSeba mentioned this pull request Dec 31, 2025
Copy link
Collaborator

@SchSeba SchSeba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

I reviewed only the second commit as the first one is implemented in the other PR

}

// If device is RDMA capable, add RDMA character devices
if rdmaCapableAttr, ok := deviceInfo.Attributes[consts.AttributeRDMACapable]; ok && rdmaCapableAttr.BoolValue != nil && *rdmaCapableAttr.BoolValue {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move this into a function that returns the mounts?

it will be good to move the others also into functions but that can be done in a separate PR :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if rdmaCapableAttr, ok := deviceInfo.Attributes[consts.AttributeRDMACapable]; ok && rdmaCapableAttr.BoolValue != nil && *rdmaCapableAttr.BoolValue {
rdmaDevices, err := host.GetHelpers().GetRDMADeviceForPCI(pciAddress)
if err != nil {
logger.Error(err, "Failed to get RDMA devices for PCI address",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should return an error here and not let the pod start without the RDMA devices.

if the VF is reported as RDMA this means we manage to run the GetRDMADeviceForPCI in the discovery.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if err != nil {
logger.Error(err, "Failed to get RDMA devices for PCI address",
"device", pciAddress)
} else if len(rdmaDevices) > 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we move into a function we can remove the else if

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if err != nil {
logger.Error(err, "Failed to get RDMA character devices, skipping",
"device", pciAddress, "rdmaDevice", rdmaDevice)
continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible for a device to report RDMA capable and not having the char devices?

In general I don't want to be in a place where the user request RDMA and at the end we allow the pod to start without the needed mounts, meaning RDMA will not really work for them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


// Add environment variable for each character device type
// Include RDMA device name to avoid collisions with multiple RDMA devices
switch {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can leave this one for now, but I am not really sure someone will really use the ENVs.

"device", pciAddress, "rdmaDevice", rdmaDevice, "charDevices", charDevices)

// Add RDMA device name to environment variables
envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_%s_RDMA_DEVICE=%s",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have a new lets of envs for RDMA and add it into the log on line (268) 0958b26#diff-c715c6df7a2299a8c29472a78a7ca23f6bc19143e636875e54a0917b0a443ba3R268

after that we can append

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

})
})

Context("RDMA Device Preparation", func() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a test to validate the mounts and the envs exist

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

pkg/host/host.go Outdated
charDevices := h.rdmaProvider.GetRdmaCharDevices(rdmaDeviceName)

if len(charDevices) == 0 {
h.log.V(2).Info("GetRDMACharDevices(): no character devices found", "rdmaDevice", rdmaDeviceName)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove from debug logs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
pkg/devicestate/state_test.go (1)

54-86: Test validates attributes but doesn't exercise production code path.

This test creates a device with rdmaCapable=false and then validates the conditional logic in comments (lines 75-78), but it doesn't actually invoke the production code that uses this logic. While the test correctly verifies that the shouldPrepareRDMA condition evaluates to false, it would be more valuable to call the actual device preparation function with this non-RDMA device to verify that RDMA preparation is truly skipped in the production code path.

💡 Consider enhancing the test

Consider invoking the actual device preparation/state management function that uses this conditional logic, so the test verifies the production behavior rather than just the condition evaluation:

 		shouldPrepareRDMA := exists && rdmaCapable.BoolValue != nil && *rdmaCapable.BoolValue
 		Expect(shouldPrepareRDMA).To(BeFalse(), "RDMA preparation should be skipped for non-RDMA capable devices")
 
-		// When this condition is false, the production code never calls:
-		// - GetRDMADeviceForPCI
-		// - GetRDMACharDevices
-		// This test verifies the condition evaluates correctly for non-RDMA devices
+		// Actually invoke the production function to verify RDMA preparation is skipped
+		// (Add appropriate function call here based on your production API)
 	})
pkg/devicestate/state.go (1)

329-330: Consider using V(2).Info for per-device logging to reduce noise.

This log statement runs for every RDMA device processed. Using logger.V(2).Info would be more consistent with similar logging patterns in the codebase (e.g., line 285) and reduce log noise in production environments with many RDMA devices.

🔎 Suggested change
-			logger.Info("Added RDMA character devices for device",
+			logger.V(2).Info("Added RDMA character devices for device",
 				"device", pciAddress, "rdmaDevice", rdmaDevice, "charDevices", charDevices, "envs", envs)
pkg/host/host.go (1)

811-812: Consider using V(2).Info for RDMA discovery result logging.

The "getting" operations log at V(2) level (lines 802, 818, 841), but the result logging uses Info level. During device discovery with many RDMA devices, this could produce verbose output. Consider using V(2).Info for consistency with similar discovery patterns elsewhere in the codebase.

🔎 Suggested changes
-	h.log.Info("GetRDMADeviceForPCI(): found RDMA devices",
+	h.log.V(2).Info("GetRDMADeviceForPCI(): found RDMA devices",
 		"pciAddress", pciAddr, "rdmaDevices", rdmaDevices)
-	h.log.Info("VerifyRDMACapability(): RDMA capability check result",
+	h.log.V(2).Info("VerifyRDMACapability(): RDMA capability check result",
 		"device", pciAddr, "rdmaCapable", hasRDMA)
-	h.log.Info("GetRDMACharDevices(): found character devices",
+	h.log.V(2).Info("GetRDMACharDevices(): found character devices",
 		"rdmaDevice", rdmaDeviceName, "charDevices", charDevices)

Also applies to: 826-827, 851-852

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0958b26 and 8ed2516.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (11)
  • go.mod
  • pkg/consts/consts.go
  • pkg/devicestate/discovery.go
  • pkg/devicestate/discovery_test.go
  • pkg/devicestate/state.go
  • pkg/devicestate/state_test.go
  • pkg/host/host.go
  • pkg/host/host_test.go
  • pkg/host/mock/mock_host.go
  • pkg/host/mock/mock_rdma_provider.go
  • pkg/host/rdma_provider.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/consts/consts.go
  • pkg/host/rdma_provider.go
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-12-10T11:30:12.071Z
Learnt from: rollandf
Repo: k8snetworkplumbingwg/dra-driver-sriov PR: 32
File: pkg/host/host.go:818-879
Timestamp: 2025-12-10T11:30:12.071Z
Learning: In Linux sysfs, /sys/class/infiniband/<device>/node_type returns numeric format like "1: CA" not plain text "CA". The numeric prefixes are: 1 = CA (Channel Adapter - InfiniBand or RoCE), 4 = RNIC (iWARP). Code should use strings.HasPrefix() to parse these values.

Applied to files:

  • pkg/devicestate/discovery.go
📚 Learning: 2025-12-10T11:30:04.648Z
Learnt from: rollandf
Repo: k8snetworkplumbingwg/dra-driver-sriov PR: 32
File: pkg/host/host.go:818-879
Timestamp: 2025-12-10T11:30:04.648Z
Learning: In Linux sysfs, /sys/class/infiniband/<device>/node_type exposes values like '1: CA' or '4: RNIC'. Implement parsing in Go by checking the numeric prefix (e.g., '1:' or '4:') rather than the text after. Use strings.HasPrefix(value, "1:") or strings.HasPrefix(value, "4:") to dispatch, and then map to a stable enum or constant (e.g., ChannelAdapter, RNIC). Avoid brittle text comparisons; validate that the prefix exists and handle unexpected values gracefully.

Applied to files:

  • pkg/host/mock/mock_host.go
  • pkg/host/host.go
  • pkg/host/mock/mock_rdma_provider.go
  • pkg/host/host_test.go
🧬 Code graph analysis (4)
pkg/host/mock/mock_host.go (1)
pkg/cni/mock/mock_cni.go (2)
  • MockInterface (23-27)
  • MockInterfaceMockRecorder (30-32)
pkg/host/host.go (1)
pkg/host/rdma_provider.go (1)
  • RdmaProvider (27-30)
pkg/host/host_test.go (2)
pkg/host/mock/mock_rdma_provider.go (2)
  • MockRdmaProvider (19-23)
  • NewMockRdmaProvider (31-35)
pkg/host/host.go (2)
  • Host (121-124)
  • NewHost (127-132)
pkg/devicestate/state.go (2)
pkg/consts/consts.go (1)
  • AttributeRDMACapable (40-40)
pkg/host/host.go (1)
  • GetHelpers (148-151)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Lint
  • GitHub Check: Unit Tests
  • GitHub Check: Build Container
🔇 Additional comments (23)
pkg/devicestate/discovery_test.go (4)

81-82: LGTM! RDMA capability checks are consistently added.

The mock expectations for VerifyRDMACapability are properly added for each VF across all test scenarios, ensuring comprehensive coverage of the RDMA capability detection flow.

Also applies to: 154-154, 156-156, 205-205, 242-242, 420-420, 550-550


211-213: NUMA semantic change is intentional and well-documented.

The NUMA node default has been changed from 0 to -1 to indicate "not supported" when NUMA detection fails. This aligns with standard Linux conventions and is properly documented in the comments.


255-321: Excellent test coverage for RDMA-capable VFs.

The test thoroughly validates:

  • Mixed RDMA-capable and non-RDMA-capable VFs from the same PF
  • Correct attribute population for RDMA-capable devices
  • Proper handling of Mellanox-specific device IDs

323-357: Good error handling test for RDMA capability checks.

The test correctly verifies that RDMA capability check failures default to false rather than causing the discovery process to fail, ensuring graceful degradation.

pkg/devicestate/discovery.go (4)

79-84: NUMA handling updated correctly.

The change to default NUMA node to -1 (not supported) instead of a fabricated value is correct and aligns with standard Linux conventions. The comment clearly explains the semantics.


136-145: NUMA parsing with proper error handling.

The NUMA node parsing correctly handles errors by defaulting to -1 and includes informative logging. The use of ptr.To(numaNodeInt) is appropriate for the int pointer attribute.


151-157: RDMA capability check with graceful error handling.

The per-VF RDMA capability verification correctly defaults to false on error, ensuring the discovery process continues even if RDMA checks fail. The error is logged appropriately.


169-209: Attributes map construction is clean and maintainable.

The refactoring to build attributes in a map before creating the device resource improves readability and makes it easier to add new attributes in the future. All attributes including the new RDMA capability flag are properly populated.

pkg/devicestate/state_test.go (3)

113-151: Comprehensive test for RDMA device handling.

The test properly validates:

  • Device nodes for all RDMA character device types (uverbs, umad, issm, rdma_cm)
  • Environment variable naming and values
  • Correct mock expectations

The assertions verify both the count and content of device nodes and environment variables.


153-183: Good coverage for multiple RDMA devices scenario.

The test correctly validates that multiple RDMA devices (mlx5_0, mlx5_1) each generate their own device nodes and environment variables with proper naming conventions.


185-259: Excellent error path coverage.

All critical error scenarios are tested:

  • GetRDMADeviceForPCI failure
  • No RDMA devices found (empty list)
  • GetRDMACharDevices failure
  • No character devices found
  • Empty RDMA device name validation

Each test properly verifies error propagation and nil returns.

pkg/host/host_test.go (3)

593-662: Excellent PCI address validation coverage.

The tests thoroughly validate:

  • Valid RDMA device returns
  • Empty device lists returning nil
  • Empty PCI address validation
  • Multiple invalid format scenarios (no colons, too few parts, missing dot, too many parts)

The error messages are descriptive and helpful for debugging.


664-697: VerifyRDMACapability tests are comprehensive.

The tests correctly validate:

  • True when RDMA devices exist
  • False when device list is empty
  • Proper error handling

The distinction between "no capabilities" and "empty list" scenarios ensures robustness.


699-786: GetRDMACharDevices tests cover all critical scenarios.

Excellent test coverage including:

  • Complete set of character devices (uverbs, umad, issm, rdma_cm)
  • Partial device sets
  • Empty results returning nil
  • Multiple RDMA devices with different character devices
  • Empty device name validation

The tests properly use ContainElements for unordered matching and NotTo(Equal()) for differentiation.

go.mod (1)

6-6: v1.1.0 is the latest stable version with no known security vulnerabilities.

Verification confirms that v1.1.0 (released August 11, 2022) is the current and latest stable release of Mellanox/rdmamap. No security advisories or CVEs have been published for this Go library. The repository is lightly maintained with only 3 releases total, but the dependency is suitable for use.

pkg/host/mock/mock_host.go (1)

190-218: LGTM - Generated mock methods follow consistent patterns.

The new RDMA-related mock methods (GetRDMACharDevices, GetRDMADeviceForPCI, VerifyRDMACapability) are correctly generated and follow the same gomock patterns as existing methods in the file.

Also applies to: 377-391

pkg/devicestate/state.go (2)

218-226: LGTM - RDMA capability check integrates well with existing flow.

The RDMA handling is properly gated behind the AttributeRDMACapable boolean check, and errors propagate correctly to prevent pod startup without required RDMA devices.


267-338: LGTM - handleRDMADevice correctly enforces RDMA device availability.

The function properly:

  • Returns errors when RDMA devices or character devices are not found (addressing past review feedback)
  • Uses unique environment variable keys per RDMA device to avoid collisions
  • Builds CDI device nodes for all RDMA character devices
pkg/host/mock/mock_rdma_provider.go (1)

1-68: LGTM - Correctly generated mock for RdmaProvider interface.

The mock implementation follows standard gomock patterns and correctly matches the RdmaProvider interface defined in rdma_provider.go.

pkg/host/host.go (4)

113-117: LGTM - Interface additions for RDMA functionality.

The new RDMA methods are well-defined and follow the existing interface patterns.


121-131: LGTM - Host struct and initialization updated for RDMA support.

The rdmaProvider field and SetRdmaProvider method enable proper dependency injection for testing.

Also applies to: 153-157


786-814: LGTM - Robust PCI address validation.

The validation correctly enforces the DDDD:BB:DD.F format by checking for 3 colon-separated parts and a dot-separated device.function suffix.


816-830: LGTM - VerifyRDMACapability and GetRDMACharDevices implementations.

Both functions correctly delegate to the rdmaProvider and handle empty results appropriately.

Also applies to: 832-854

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
pkg/devicestate/state_test.go (1)

96-111: Unnecessary host.GetHelpers() call on line 100.

The call _ = host.GetHelpers() appears to force initialization of the global helpers, but the result is immediately discarded. Since origHelpers = host.Helpers on line 101 captures the current value anyway, this call may be redundant unless there's a specific initialization side effect needed.

🔎 Consider removing the unused call
 		BeforeEach(func() {
 			mockCtrl = gomock.NewController(GinkgoT())
 			mockHost = mock_host.NewMockInterface(mockCtrl)
 			// Save original helpers and replace with mock
-			_ = host.GetHelpers()
+			host.GetHelpers() // Ensure helpers are initialized before capturing
 			origHelpers = host.Helpers
 			host.Helpers = mockHost

Or if the initialization is truly not needed:

 		BeforeEach(func() {
 			mockCtrl = gomock.NewController(GinkgoT())
 			mockHost = mock_host.NewMockInterface(mockCtrl)
 			// Save original helpers and replace with mock
-			_ = host.GetHelpers()
 			origHelpers = host.Helpers
 			host.Helpers = mockHost
pkg/devicestate/state.go (1)

288-327: Potential duplicate device nodes for shared character devices like rdma_cm.

The /dev/infiniband/rdma_cm device is typically a single shared device across all RDMA devices on the system. When iterating over multiple RDMA devices, this same path could be added to deviceNodes multiple times, resulting in duplicate entries in the CDI spec.

While CDI may handle duplicates gracefully, consider deduplicating device nodes or tracking already-added paths.

🔎 Suggested deduplication approach
 func (s *Manager) handleRDMADevice(ctx context.Context, pciAddress, deviceName string) ([]*cdispec.DeviceNode, []string, error) {
 	logger := klog.FromContext(ctx).WithName("handleRDMADevice")
 	var deviceNodes []*cdispec.DeviceNode
 	var envs []string
+	addedDevicePaths := make(map[string]bool)
 
 	// ... existing code ...
 
 		// Add each character device to the CDI spec
 		for _, charDev := range charDevices {
+			// Avoid duplicate device nodes (e.g., rdma_cm is shared)
+			if addedDevicePaths[charDev] {
+				continue
+			}
+			addedDevicePaths[charDev] = true
+
 			deviceNodes = append(deviceNodes, &cdispec.DeviceNode{
 				Path:     charDev,
 				HostPath: charDev,
 				Type:     "c", // character device
 			})

Could you verify if /dev/infiniband/rdma_cm is indeed shared across RDMA devices and whether duplicate entries in CDI specs cause any runtime issues?

pkg/host/host.go (1)

835-854: Consider adjusting log levels for RDMA character device lookups.

The GetRDMACharDevices function logs at Info level (lines 841, 847, 851-852) for every call. Since this is invoked per RDMA device during container setup, it could generate significant log volume. Consider using V(2).Info for the entry log (line 841) to align with other similar methods.

🔎 Suggested log level adjustment
-	h.log.Info("GetRDMACharDevices(): getting character devices for RDMA device", "rdmaDevice", rdmaDeviceName)
+	h.log.V(2).Info("GetRDMACharDevices(): getting character devices for RDMA device", "rdmaDevice", rdmaDeviceName)

The "found character devices" log (line 851) at Info level is appropriate for visibility of successful RDMA setup.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8ed2516 and 144ef0a.

📒 Files selected for processing (7)
  • pkg/devicestate/state.go
  • pkg/devicestate/state_test.go
  • pkg/host/host.go
  • pkg/host/host_test.go
  • pkg/host/mock/mock_host.go
  • pkg/host/mock/mock_rdma_provider.go
  • pkg/host/rdma_provider.go
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-10T11:30:04.648Z
Learnt from: rollandf
Repo: k8snetworkplumbingwg/dra-driver-sriov PR: 32
File: pkg/host/host.go:818-879
Timestamp: 2025-12-10T11:30:04.648Z
Learning: In Linux sysfs, /sys/class/infiniband/<device>/node_type exposes values like '1: CA' or '4: RNIC'. Implement parsing in Go by checking the numeric prefix (e.g., '1:' or '4:') rather than the text after. Use strings.HasPrefix(value, "1:") or strings.HasPrefix(value, "4:") to dispatch, and then map to a stable enum or constant (e.g., ChannelAdapter, RNIC). Avoid brittle text comparisons; validate that the prefix exists and handle unexpected values gracefully.

Applied to files:

  • pkg/host/mock/mock_host.go
  • pkg/host/host.go
  • pkg/host/host_test.go
  • pkg/host/rdma_provider.go
  • pkg/host/mock/mock_rdma_provider.go
🧬 Code graph analysis (5)
pkg/devicestate/state_test.go (2)
pkg/consts/consts.go (2)
  • AttributePciAddress (32-32)
  • AttributeRDMACapable (40-40)
pkg/host/host.go (3)
  • Interface (74-118)
  • GetHelpers (148-151)
  • Helpers (136-136)
pkg/devicestate/state.go (2)
pkg/consts/consts.go (1)
  • AttributeRDMACapable (40-40)
pkg/host/host.go (1)
  • GetHelpers (148-151)
pkg/host/mock/mock_host.go (1)
pkg/cni/mock/mock_cni.go (2)
  • MockInterface (23-27)
  • MockInterfaceMockRecorder (30-32)
pkg/host/host.go (1)
pkg/host/rdma_provider.go (1)
  • RdmaProvider (27-30)
pkg/host/host_test.go (2)
pkg/host/mock/mock_rdma_provider.go (2)
  • MockRdmaProvider (19-23)
  • NewMockRdmaProvider (31-35)
pkg/host/host.go (2)
  • Host (121-124)
  • NewHost (127-132)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Unit Tests
  • GitHub Check: Lint
  • GitHub Check: Build Container
🔇 Additional comments (9)
pkg/host/rdma_provider.go (1)

1-47: Clean abstraction for RDMA provider dependency injection.

The interface design is well-structured for testability. The wrapper pattern around rdmamap library allows easy mocking in unit tests while keeping the production implementation thin.

Minor observation: The interface methods don't return errors, which aligns with the underlying rdmamap library signatures. Callers should handle empty slice returns as "not found" conditions.

pkg/devicestate/state_test.go (1)

113-259: Comprehensive test coverage for handleRDMADevice.

The test suite thoroughly covers:

  • Single RDMA device with all character device types (uverbs, umad, issm, rdma_cm)
  • Multiple RDMA devices per PCI address
  • Error propagation from GetRDMADeviceForPCI and GetRDMACharDevices
  • Edge cases: empty device lists

The environment variable format validation (e.g., SRIOVNETWORK_device_1_mlx50_RDMA_UVERBS) confirms the sanitization logic works correctly with underscores and hyphens.

pkg/host/mock/mock_host.go (1)

190-218: Generated mock methods correctly match the interface.

The new RDMA-related mock methods (GetRDMACharDevices, GetRDMADeviceForPCI, VerifyRDMACapability) are correctly generated and align with the Interface definition in host.go. No issues with the generated code.

Also applies to: 378-391

pkg/devicestate/state.go (2)

218-226: RDMA integration point is correctly placed.

The RDMA handling is appropriately integrated after VFIO and vhost processing. The condition check correctly validates the RDMA capability attribute before invoking handleRDMADevice, and errors are propagated to prevent pod startup with incomplete RDMA configuration.


329-335: LGTM - Environment variable generation follows the established pattern.

The RDMA device name is correctly appended to environment variables after processing all character devices for each RDMA device, maintaining consistency with the naming convention.

pkg/host/host_test.go (1)

593-786: Excellent test coverage for RDMA device functions.

The test suite comprehensively validates:

  • Successful RDMA device and character device retrieval
  • Nil returns for non-RDMA-capable devices
  • Input validation for empty and malformed PCI addresses
  • PCI address format validation (DDDD:BB:DD.F)
  • RDMA capability verification logic

The tests correctly verify that empty results return nil rather than empty slices, matching the production implementation.

pkg/host/mock/mock_rdma_provider.go (1)

1-68: Generated mock correctly implements RdmaProvider interface.

The mock is properly generated from rdma_provider.go and follows gomock conventions. The methods correctly return []string without error, matching the interface definition.

pkg/host/host.go (2)

113-118: Clean interface extension and dependency injection support.

The RDMA methods are well-placed within the Interface definition, and SetRdmaProvider provides a clean mechanism for injecting mock providers in tests without modifying the global Helpers pattern.

Also applies to: 153-157


786-814: PCI address validation is correctly strict.

The validation now properly requires the full DDDD:BB:DD.F format with three colon-separated parts and a dot-separated device.function, addressing previous review feedback.

Returning nil, nil for empty results (lines 807-809) is appropriate since "no RDMA devices" is not an error condition at this layer - callers handle this based on their requirements.

}

// If device is RDMA capable, add RDMA character devices
if rdmaCapableAttr, ok := deviceInfo.Attributes[consts.AttributeRDMACapable]; ok && rdmaCapableAttr.BoolValue != nil && *rdmaCapableAttr.BoolValue {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: i would move this long check into handleRDMADevice()

it can return early (with empty list), then always append deviceNodes and envs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return nil, fmt.Errorf("error handling RDMA device: %w", err)
}
deviceNodes = append(deviceNodes, rdmaDeviceNodes...)
envs = append(envs, rdmaEnvs...)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a general question about the environment variables we add per device.

with current implementation we have a bunch for each PCI device prepared which contains information about the various mounts (vfio, tun, vhost, rdma) as well as net-attach-def config name and the pci address.

should we have a single environment variable per device with a json map like we have in device plugin ?

e.g: SRIOVNETWORK_03_01_00_2_INFO={"rdma": {...}, "vfio": {....}, "vhost": {...}}

@SchSeba @rollandf thoughts ?

for this PR i think we need to keep the current style, then if we agree, align it all to a different format in a different PR.

logger.V(2).Info("Device is RDMA capable, adding RDMA character devices",
"device", pciAddress, "rdmaDevices", rdmaDevices)

for _, rdmaDevice := range rdmaDevices {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we expect to have exactly one RDMA device, i think we should check and fail if more than one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

// Use RDMA device name in env var key to support multiple RDMA devices
rdmaDeviceSanitized := strings.ReplaceAll(rdmaDevice, "_", "")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we expect to have 1 rdma device per PCI device so IMO we dont need it part of the env var name

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// Include RDMA device name to avoid collisions with multiple RDMA devices
switch {
case strings.Contains(charDev, "uverbs"):
envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_%s_RDMA_UVERBS=%s", devicePrefix, rdmaDeviceSanitized, charDev))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: UVERBS -> UVERB

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@rollandf rollandf force-pushed the rdma-attach branch 2 times, most recently from c1923aa to e857b30 Compare February 12, 2026 12:11
@rollandf
Copy link
Collaborator Author

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces RDMA device support, including discovery and injection into containers via CDI. The changes are well-structured, with a new handleRDMADevice function to encapsulate the logic and comprehensive unit tests to cover various scenarios. The integration with the host helper interface and the use of a dedicated RDMA provider are good design choices. I have one suggestion to improve the robustness of character device type detection.

Comment on lines 319 to 328
switch {
case strings.Contains(charDev, "uverbs"):
envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_RDMA_UVERB=%s", devicePrefix, charDev))
case strings.Contains(charDev, "umad"):
envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_RDMA_UMAD=%s", devicePrefix, charDev))
case strings.Contains(charDev, "issm"):
envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_RDMA_ISSM=%s", devicePrefix, charDev))
case strings.Contains(charDev, "rdma_cm"):
envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_RDMA_CM=%s", devicePrefix, charDev))
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using strings.Contains to identify character device types could be brittle. For example, if a device path were /dev/infiniband/my-uverbs-device, it would incorrectly match as a uverbs device. A more robust approach is to check the prefix of the device file's base name. For rdma_cm, an exact match is even better as it's a singleton device.

Note: This change requires importing the path/filepath package.

Suggested change
switch {
case strings.Contains(charDev, "uverbs"):
envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_RDMA_UVERB=%s", devicePrefix, charDev))
case strings.Contains(charDev, "umad"):
envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_RDMA_UMAD=%s", devicePrefix, charDev))
case strings.Contains(charDev, "issm"):
envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_RDMA_ISSM=%s", devicePrefix, charDev))
case strings.Contains(charDev, "rdma_cm"):
envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_RDMA_CM=%s", devicePrefix, charDev))
}
switch {
case strings.HasPrefix(filepath.Base(charDev), "uverbs"):
envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_RDMA_UVERB=%s", devicePrefix, charDev))
case strings.HasPrefix(filepath.Base(charDev), "umad"):
envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_RDMA_UMAD=%s", devicePrefix, charDev))
case strings.HasPrefix(filepath.Base(charDev), "issm"):
envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_RDMA_ISSM=%s", devicePrefix, charDev))
case filepath.Base(charDev) == "rdma_cm":
envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_RDMA_CM=%s", devicePrefix, charDev))
}

@rollandf
Copy link
Collaborator Author

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces RDMA device support, integrating with CDI for device injection into containers. The implementation is well-designed, with a new handleRDMADevice function in pkg/devicestate/state.go that cleanly encapsulates the logic for handling RDMA-capable devices. The host package is appropriately extended to discover RDMA character devices, maintaining a good separation of concerns. The changes are supported by a comprehensive suite of unit tests that cover various scenarios, including error conditions. Additionally, a minor improvement is suggested for mock accuracy in pkg/devicestate/state_test.go. Overall, this is a solid contribution that adds significant new functionality.

mockHost.EXPECT().GetRDMADevicesForPCI(pciAddress).Return([]string{rdmaDeviceName})

// Mock GetRDMACharDevices to return empty list
mockHost.EXPECT().GetRDMACharDevices(rdmaDeviceName).Return([]string{}, nil)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The mock for GetRDMACharDevices returns an empty slice ([]string{}), but the actual implementation in pkg/host/host.go returns nil when no character devices are found. While the current test passes because len() is 0 for both an empty slice and a nil slice, it's best practice for mocks to accurately reflect the behavior of the functions they are replacing to prevent future breakages if the calling code changes to differentiate between them.

Suggested change
mockHost.EXPECT().GetRDMACharDevices(rdmaDeviceName).Return([]string{}, nil)
mockHost.EXPECT().GetRDMACharDevices(rdmaDeviceName).Return(nil, nil)

Implements RDMA capability detection and device exposure using the
Mellanox rdmamap library. Adds discovery of RDMA devices (mlx5_*),
protocol detection (RoCE/InfiniBand/iWARP), and automatic injection
of character devices (/dev/infiniband/*) into containers via CDI.

Key changes:
- Add RdmaProvider interface with rdmamap integration
- Expose RDMA character devices (uverbs, umad, issm, rdma_cm)
- Set environment variables for device paths and names
- Add comprehensive unit tests for RDMA functionality

Matches feature parity with sriov-network-device-plugin for RDMA.

Tested on Mellanox ConnectX hardware with RoCE.

Signed-off-by: Fred Rolland <frolland@nvidia.com>
@rollandf
Copy link
Collaborator Author

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces RDMA device support, including discovery and injection into containers via CDI. While no specific vulnerabilities or critical issues were found, there is a bug in how environment variables are created for RDMA character devices. This can lead to overwritten environment variables if multiple devices of the same type exist. It is suggested to aggregate these device paths into a comma-separated list, aligning with sriov-network-device-plugin behavior to ensure all device paths are correctly exposed.

Comment on lines +310 to +330
// Add each character device to the CDI spec
for _, charDev := range charDevices {
deviceNodes = append(deviceNodes, &cdispec.DeviceNode{
Path: charDev,
HostPath: charDev,
Type: "c", // character device
})

// Add environment variable for each character device type
// Include RDMA device name to avoid collisions with multiple RDMA devices
switch {
case strings.HasPrefix(filepath.Base(charDev), "uverbs"):
envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_RDMA_UVERB=%s", devicePrefix, charDev))
case strings.HasPrefix(filepath.Base(charDev), "umad"):
envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_RDMA_UMAD=%s", devicePrefix, charDev))
case strings.HasPrefix(filepath.Base(charDev), "issm"):
envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_RDMA_ISSM=%s", devicePrefix, charDev))
case filepath.Base(charDev) == "rdma_cm":
envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_RDMA_CM=%s", devicePrefix, charDev))
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation for creating environment variables for RDMA character devices has a bug. If there are multiple character devices of the same type (e.g., /dev/infiniband/uverbs0 and /dev/infiniband/uverbs1), this loop will generate multiple environment variables with the same key (e.g., SRIOVNETWORK_<PREFIX>_RDMA_UVERB). This will lead to unpredictable behavior as only one of these values will likely be available in the container's environment.

To ensure correctness and align with the feature parity goal with sriov-network-device-plugin, these paths should be aggregated into a single, comma-separated environment variable. Also, the variable names should be plural (e.g., UVERBS) for consistency.

Here is a suggested implementation:

	// Add each character device to the CDI spec
	uverbsPaths := []string{}
	umadPaths := []string{}
	issmPaths := []string{}
	rdmaCmPaths := []string{}

	for _, charDev := range charDevices {
		deviceNodes = append(deviceNodes, &cdispec.DeviceNode{
			Path:     charDev,
			HostPath: charDev,
			Type:     "c", // character device
		})

		// Group paths by type to create comma-separated env vars
		baseName := filepath.Base(charDev)
		switch {
		case strings.HasPrefix(baseName, "uverbs"):
			uverbsPaths = append(uverbsPaths, charDev)
		case strings.HasPrefix(baseName, "umad"):
			umadPaths = append(umadPaths, charDev)
		case strings.HasPrefix(baseName, "issm"):
			issmPaths = append(issmPaths, charDev)
		case baseName == "rdma_cm":
			rdmaCmPaths = append(rdmaCmPaths, charDev)
		}
	}

	// Add environment variables for each character device type.
	// This uses comma-separated lists to support multiple character devices of the same type,
	// matching feature parity with sriov-network-device-plugin.
	if len(uverbsPaths) > 0 {
		envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_RDMA_UVERBS=%s", devicePrefix, strings.Join(uverbsPaths, ",")))
	}
	if len(umadPaths) > 0 {
		envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_RDMA_UMADS=%s", devicePrefix, strings.Join(umadPaths, ",")))
	}
	if len(issmPaths) > 0 {
		envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_RDMA_ISSMS=%s", devicePrefix, strings.Join(issmPaths, ",")))
	}
	if len(rdmaCmPaths) > 0 {
		envs = append(envs, fmt.Sprintf("SRIOVNETWORK_%s_RDMA_CM=%s", devicePrefix, strings.Join(rdmaCmPaths, ",")))
	}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants